cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

zenome
Journeyman III

Unexpected Cache Hit statistics

Hi All,

I was wondering if the cache hit statistics output by the AMD APP Profiler (v2.5) are reliable.

I was experimenting with a very simple kernel:

   int iGID = get_global_id(0);

   c[iGID] = a[iGID];

When I run the above kernel on AMD 7970, I get the CacheHit as low as 0.4%.

Now, since in this case the global memory accesses are coalesced, I was assuming that I should be getting a very high Cache Hit rate given the global memory accesses are cached.

When I run the following kernel:

   int iGID = get_global_id(0);

   c[iGID] = a[iGID] + a[iGID+1] + a[iGID+2] + a[iGID+3] +a[iGID+4]; // I am taking care of not overshooting the buffer size by last threads

I get CacheHit as 12%.

Again, I was expecting a very large CacheHit here.

I have played around these around with 4096 Threads and 256 WG Size. The array 'a' should easily fit inside the 16KB L1 cache per CU>

Are these numbers very low because:

The Profiler only samples a particular block, which if happens to be the first one, Cache Hits are low due to cold-misses  (I know, it really can not account for these numbers but that is my only guess) ?

0 Kudos
Reply
22 Replies
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Caches need to be warmed up first. If you are accessing a variable only once, then it will need to come from global memory for first time. This explains 0.4%. I think this is reasonable

As far as your second code goes, it could be due to the CU executing multiple workgroups.

If the caches are not fully associative, you can expect multiple workgroups to displace each other's data.

Not sure what is the cache architecture on 7970 is.

Do you have any idea?

zenome
Journeyman III

Re: Unexpected Cache Hit statistics

Well, GCN Whitepaper says it to be 4-way set associative with 64 B cache line size. Per CU L1 cache size is 16KB.

In the second case, I tried to run following kernel with total 1024 threads(total) and the cache hit now is 14%.

int iGID = get_global_id(0);

c[iGID] = a[iGID] + a[iGID+1] + a[iGID+2] + a[iGID+3] + a[iGID+4]+ a[iGID+5]+ a[iGID+6]+ a[iGID+7]+ a[iGID+8];

Now, 1024 threads (total) or 16 Wavefronts is really not a big number for 7970 (it has 32 CUs).

Even if all the threads are somehow scheduled on one CU (7970 allows 40 Wavefronts/CU to be active at a time), the array 'a' of size 4K very nicely fits in the 16K L1 cache.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Usually, the wavefronts are scheduled on CUs before they get stacked up on a particular CU.

So, given 16 wavefronts, each wavefront will posibly go to 1 CU. And 16 CUs will still stall.

Cant you make it at least 32 wavefronts?

Also, I just hope the profiler is smart enough to profile on a CU that is being used 😉

I am assuming the data type of A is "int'.

Also, we dont know how the compiler has scheduled the loads.

Can you try first loading A[gid] and then do a "barrier".

After that, read the rest, add up and then write C

0 Kudos
Reply
Raistmer
Adept II

Re: Unexpected Cache Hit statistics

Can't recall precisely (if it was ATi or NV profiler) but there was info that profiling counters work only on first CU in device.

Usually first CU will be loaded most indeed.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Usually first CU will be loaded most indeed.

All CUs get equal load. There is nothing special about CU 0.

But, it could be possible that if there is only one workgroup launched -- it will always go to CU 0. So, profliling CU0 probably is a safe bet because sometihng will be executing there for sure. This is just a guess.

Apart from that, there is nothing that suggests that CU0 will be the most loaded.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Can you try the "barrier" thing listed above and let us know if that helped?

Also, please increase the number of workgroups.

0 Kudos
Reply
Skysnake
Adept II

Re: Unexpected Cache Hit statistics

I hope you know, that the LLVM "compiles" your code, and what runs on the hardware is not always what you can see.

Just with such a simple Kernel, the compiler will make heavy use from registers and so on.

The LLVM is really powerful! Just see it as a GCC mit -O3. When you really want to know what is going on, you have to look at the Assembler-Code.

When you do useless stuff, the LLVM very often just skip this parts of Code

So it is really not easy to utalize the GPU with useless stuff... I have tried this years ago, and have really often noticed, that it is really difficult to add useless code, to utalize the GPU much better for Testings.

If you want to do something like this, usw branches, where you check your result!

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Skysnake,

This is not really a useless code. The compiler cannot ignore a code snippet like below.

C[gid] = A[gid] + A[gid +1] + .... and so on..

This is not a un-reachable code (or) something that can be calculated at compile time.

LLVM or whatever compiler -- has to generate code for this.

0 Kudos
Reply
Skysnake
Adept II

Re: Unexpected Cache Hit statistics

When you never use this manipulations of Data, the LLVM can delete all this apoerations sometimes. You really have to "use" the results. when you write them into a write buffer in the global memory everything is fine.

I am not able to see if you do this or not with just such a few lines of code. I can you only say, that you have to be careful, because it is not so clear what the GPU really do like the most people think

0 Kudos
Reply