Hi All,
I was wondering if the cache hit statistics output by the AMD APP Profiler (v2.5) are reliable.
I was experimenting with a very simple kernel:
int iGID = get_global_id(0);
c[iGID] = a[iGID];
When I run the above kernel on AMD 7970, I get the CacheHit as low as 0.4%.
Now, since in this case the global memory accesses are coalesced, I was assuming that I should be getting a very high Cache Hit rate given the global memory accesses are cached.
When I run the following kernel:
int iGID = get_global_id(0);
c[iGID] = a[iGID] + a[iGID+1] + a[iGID+2] + a[iGID+3] +a[iGID+4]; // I am taking care of not overshooting the buffer size by last threads
I get CacheHit as 12%.
Again, I was expecting a very large CacheHit here.
I have played around these around with 4096 Threads and 256 WG Size. The array 'a' should easily fit inside the 16KB L1 cache per CU>
Are these numbers very low because:
The Profiler only samples a particular block, which if happens to be the first one, Cache Hits are low due to cold-misses (I know, it really can not account for these numbers but that is my only guess) ?