Hi, everyone is using the GMB microbenchmark output as the etalon for kernel optimization. I was really wondering, why I do not get similar results in my own kernel, doing aligned coalesced reads and writes.
When I look into the GMB kernel's code, I see that in fact (for example in linear read benchmark) each adress in memory is not read only once, but all work items read the data on index equal to their gid, then the index + 1... this means that the work-item with local id X is in step N reading from the same address as the work-item with local id X + 1 read in step N - 1. Therefore the result is cached, and therefore much faster.
However, the profiler shows for all kernels 0% cache hit (that seems a little bit weird, isn't it?).
Am I understanding the code correctly? Should the memory bandwidth be really measured this way, with cached results? I think that we should really read each element in the buffer only once to get the correct image of system's performance.
For completeness: I use ATI Radeon Mobile HD 4500, ATI APP 2.3 SDK and Catalyst 11.2 and Intel i5-430M CPU. The benchmark shows results about 17/10/10/12 GB/s
Thanks, so from what I understand the cacheing is only available on the constant memory so the results in fact aren't biased.
However, if cacheing was working, it would provide incorrect results - therefore the method is conceptually wrong, isn't it?
yeah caching is likely to happen in that case. But there is another kernel called read_linear_uncached which can be considered as a benchmark.