Hi, everyone is using the GMB microbenchmark output as the etalon for kernel optimization. I was really wondering, why I do not get similar results in my own kernel, doing aligned coalesced reads and writes.
When I look into the GMB kernel's code, I see that in fact (for example in linear read benchmark) each adress in memory is not read only once, but all work items read the data on index equal to their gid, then the index + 1... this means that the work-item with local id X is in step N reading from the same address as the work-item with local id X + 1 read in step N - 1. Therefore the result is cached, and therefore much faster.
However, the profiler shows for all kernels 0% cache hit (that seems a little bit weird, isn't it?).
Am I understanding the code correctly? Should the memory bandwidth be really measured this way, with cached results? I think that we should really read each element in the buffer only once to get the correct image of system's performance.
For completeness: I use ATI Radeon Mobile HD 4500, ATI APP 2.3 SDK and Catalyst 11.2 and Intel i5-430M CPU. The benchmark shows results about 17/10/10/12 GB/s