cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

zenome
Journeyman III

Re: Unexpected Cache Hit statistics

Hi,

Thanks a lot Himanshu for your suggestions and following up on this.

Very sorry for delay in reply.

I did some new tests:

Just to recall that I am running these on AMD 7970 having 32 CUs

Case 1:

int iGID = get_global_id(0);

c[iGID] = a[iGID];

WorkgroupSize=256, TotalThreads=256*32, CacheHit%=4% // multiplied by 32 for fair load-balancing of blocks on CUs

WorkgroupSize=64, TotalThreads=64*32, CacheHit%=13% // 64 is the wavefront(WF) size

WorkgroupSize=16, TotalThreads=16*32, CacheHit%=37% // 16 bcoz a WF is divided into QuarterWFs while execution

WorkgroupSize=8, TotalThreads=8*32, CacheHit%=66% // no big logic behind 8, just to reduce cache contention

My observations:

* Since GPU tries to have global mem loads cached, so all the threads in WF will have to miss before they access it first, not sure how it is as high as 37% in case of 16 threads per workgroup (quarter WF is the lowest granularity of execution)

*It seems to be the case that even within a WF, cache eviction happens between Quarter Wavefronts. See the 64 and 16 case. But I am not convinced assuming an LRU policy.

Case 2:

int iGID = get_global_id(0);

float tmp = a[iGID];

barrier()

tmp += a[iGID+1] + a[iGID+2]+a[iGID+3]

c[iGID]=tmp

WorkgroupSize=256, TotalThreads=256*32, CacheHit%=7% // multiplied by 32 for fair load-balancing of blocks on CUs

WorkgroupSize=64, TotalThreads=64*32, CacheHit%=23% // 64 is the wavefront(WF) size

WorkgroupSize=16, TotalThreads=16*32, CacheHit%=52% // 16 bcoz a WF is divided into QuarterWFs while execution

WorkgroupSize=8, TotalThreads=8*32, CacheHit%=70% // no big logic behind 8, just to reduce cache contention

* The behavior here is still questionable. There is not much increase in respective numbers from Case1. The loads a[iGiD+ 1,2,3...] should have all hit the cache given we are using a very puny fraction of the large L1 cache per CU.

Well, for all the above, I am assuming that not multiple Workgroups are scheduled on same CU. I can try forcing only one Workgroup per CU by having large local memory usage in the kernels.

0 Kudos
Reply
zenome
Journeyman III

Re: Unexpected Cache Hit statistics

Thanks. But I still feel there is no unused code in these small programs as all the operations result in a write to the Global Mem.

I verified it with the disassembled code. So, for the case of :

float tmp = a[iGID];

barrier()

tmp += a[iGID+1] + a[iGID+2]+a[iGID+3]

The disassembled code looks like:

t_buffer_load_format_x  v2, v1, s[8:11], 0 offen forma ......

s_barrier                                                 

t_buffer_load_format_x  v3, v1, s[8:11], 0 offen offs ...

t_buffer_load_format_x  v4, v1, s[8:11], 0 offen offs...

t_buffer_load_format_x  v1, v1, s[8:11], 0 offen offs...

.

.

.

t_buffer_store_format_x  v1, v0, s[0:3], 0 offen....

0 Kudos
Reply
Skysnake
Adept II

Re: Unexpected Cache Hit statistics

Ok, i have it -.-

Remember, what get_global_id(0) makes. You get for 256 Threads [0-255] as result.  Because of the form of your index (+x [0-8]) you use the array-elements [0-263]. So you use every data less than 8 times per workgroupe (note, i assume, that you start DIM0=256,DIM1=32,DIM2=0 threads)

So per workgroupe you have to load 263 different elements. In sum 256*8=2048 Elements. Just make ist easy and say, you reuse every element 8 times. Than you have a maximum cachehit of ~0.4% in every Workgroupe, when you never flash your cache!

The rest of the hits comes from the L2 where you have a cachehit ratio of a maximum if 1/32 so ~3.2%  (assum, that i have not use the result value, because it is not clear, if the GPU use a register to store the data or not)

To rise the cachehitratio, you have to use get_local_id(0)

0 Kudos
Reply
zenome
Journeyman III

Re: Unexpected Cache Hit statistics

Just to add some more  twist, the same microbenchmarks when run on an Nvidia Fermi :

Case 1:

c[tid] = a[tid];


0% L1 cache hits . Quite expected, cold cache misses

Case 2:

float tmp = a[iGID];

barrier() --> __syncthreads()

tmp += a[iGID+1] + a[iGID+2]+a[iGID+3]

c[iGID]=tmp

~73 to 88% (increasing with the block size) L1 cache hits. Again, as expected.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

Hey,

Nice experiments. I think the L1 cache is shared among 4 CUs.

See below from GCN Architecture paper.

The 16KB scalar data L1 is 4-way associative with 64B lines and LRU replacement; it is also shared between a cluster of up to 4 Compute Units and backed by the L2 cache.

That probably explains why you are getting hits on AMD platform for Case 2 (assuming there is no dead code there).

Case 1 is puzzling.

The same paper also talks about CU specific L1 data cache.

btw, what are the data-types of C and A? I hope they are ints.

On NVIDIA, the L1 cache is tied to the CU. Each CU has its own L1 cache - which can also be configured as "local memory".

Hence you are seeing absolute 0% cache-hits for c[] = a[];

Whereas, the Texture Cache in NVIDIA is shared by more CUs (multiprocessors) and they are read-only in nature.

This is simlar to AMD's L1 scalar caches - whcih I believe are read-only (Correct me if  am wrong here)

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

I hope Case2 meant "c[iGID] = tmp;"

Otherwise, it will be optmized out as dead-code.

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: Unexpected Cache Hit statistics

GCNCacheArchitecture.png

Courtesy: GCN Arch whitepaper.

As you can see, th Vector Data cache is private to CU, Scalar Data Cache is shared by CUs and there is also the instruction cache.

I just hope the cache-usage that you are getting does not factor in instruction cache 😞

Can you get more details of how exactly you are measuing the cache-hit ratio?

Is it specific to L1 or Does it include L2 as well?

Also, I am not sure if the hardware can prefetch data into cache-lines.

I dont see it in the whitepaper.

Another possibility is Say - A Cache miss fetching 4 lines together...

(i have seen similar stuff with Intel -- wherein it brings in 2 cache-lines per cache-miss instead of one)

0 Kudos
Reply
zenome
Journeyman III

Re: Unexpected Cache Hit statistics

That's correct, my bad while typing the post. I have corrected it. (no dead-code in the actual program)

Also the data types of A and C are float.

0 Kudos
Reply
zenome
Journeyman III

Re: Unexpected Cache Hit statistics

I am measuring the cache hits by the AMD APP profiler, which does not specify it to be L1 or L2. Though it claims it to be data cache and not instruction.

CacheHit : The percentage of fetches that hit the data cache. Value range: 0% (no hit) to 100% (optimal).

0 Kudos
Reply
lbin
Staff
Staff

Re: Unexpected Cache Hit statistics

It's L2 cache.

0 Kudos
Reply