Archives Discussions

jai_cool · ‎09-10-2012

Hi there,

I would like to know how to write L1 Cache friendly code. Assuming L1 Cache line is X Bytes long, if my wavefront needs X Bytes from device memory and other wavefronts in the block OR wavefronts from other blocks that run parallel on the same execution unit need the same X Bytes, I expect to see high L1 cache hit rate.

In the following code :

<code>

__kernel void testKernel(__global__ float *A, __global__ float *B)

{

// Testing if this hits L1 cache.

float val = B[0];

if (get_group_id(0) == 0) A[get_local_id(0)] = val;

}

Is this L1 Cache friendly code ? Should I expect great L1 Cache hit rate running this through the AMD OpenCL Profiler running this for 100s of blocks, each block with 256 or 512 threads.

</code>

jai_cool · ‎09-11-2012

Can anyone help me with this.

Thanks!

binying · ‎09-11-2012

What's your hardware?

amdkid · ‎09-11-2012

Hello jai_cool

I think you need use builtin function prefetch, but I don't know how use it effectively, and I also interested in this issue.

jai_cool · ‎09-11-2012

My GPU is AMD Radeon™ HD 6970. Please let me know how to get a high L1 cache hit rate. My understanding is all workgroups (or even wavefronts in a workgroup , even better) should read the same address or the same cache line as I have shown in my code snippet.

amdkid · ‎09-12-2012

I alredy wrote to you I don't know, but I also interested in this issue, that's why I in that branch.

notzed · ‎09-12-2012

Somewhere in chapter 4 of the app programming guide is about all the public information available about it.

Other than that "l1 cache friendly code" is just code that doesn't access much range of memory since L1 is so small.

jai_cool · ‎09-12-2012

So,

Are we all convinced that the code snippet I posted should result in a very high L1 cache hit rate?

mikewolf_gkd · ‎09-12-2012

I think your code is cache friendly, L1 cache line is 64 bytes, and 16 thread should be coalesced read/write.