I have a kernel that uses a global read only uint array where each work item read 18 address sequentially. i.e. All work item has a different set of 18 uints.
I also have a global read only array of uint with only 4 element. The 4 uints are broadcast to all work-items.
Now I can't (and also don't want to) use LDS for it both of the problems.
1. Is it possible to use L1 in both the cases?
2. I have set the kernel arguments like this as per the OpenCL Programming Guide May 2012 Pg 5-13.
__kernel void mykernel( __global uint const * restrict key, //18 uints per work-item
__global uint const * restrict salt , //4 uint for broadcast
. //other args
Is there anything else I need to do in order to cache the data in L1?
Can you change your salt to an int4? Wouldn't an int4 use memory more efficiently and also be guaranteed not to alias (so you don't need to worry about a restricted pointer)?
__global uint4 const salt,
uint salt1 = salt.s0;
uint salt2 = salt.s1;
uint salt3 = salt.s2;
uint salt4 = salt.s3;
Yes I could do that. But would it use L1 cache?
AFAIK, L1 is always enabled on 7970. "const restrict" qualifier is not necessary anymore. Kbrafford's advice is also useful, because salt will be loaded into scalar registers.
Would it also work better if somehow the key field was packed as an array of int4 types instead? (assuming a clever way to deal with the odd number of total ints needed was found)
Yes I can pack the key field as five int4. Although it would waste some space but I'm ready to do that if they are cached into L1. They will be read sequentially numerous(more than 1000 times) times in a work item.
18 int per work-item does not fit into L1 cache. You should dispatch at least 256 work-items per CU, so total memory will be 18*4*256 = 18kB.
No my implementation is limited by LDS. So a maximum of 16 work-item per CU will be dispatched.
L1 access has hundreds cycles latency even if you get cache hit. Normally this could be hidden by interleave wavefront execution. One wavefront per CU is a really bad idea.
There is no way you will get performance with 16work-iterms per CU. A CU must have atleast 2 wavefronts assigned to it(that too when kernels are very compute intensive). If the LDS size is stopping you to launch that many work-items, i would rather go with using global buffers( because most likely, your kernel is memory intensive).
The float4/int4 access are good for non-GCN architecture, I guess scalar loads are good for GCN though. I would suggest you to give more details on what application you are working on. I doubt there will be lot of bank conflicts if you access 18uints per work-item sequentially.
I believe it would work better. unit4/float4 is the best pattern for the L1 cache access. L1 was designed for 16 byte types - they are used by game shaders. Sometimes hardware could coalesce multiple unit accesses into uint4 access. Sometimes compiler could do the same.
Read-only data that is under 64KB in size should go into the constant address space.