Archives Discussions

sayantandatta · ‎06-08-2012

Hello everyone,

Problem1.

I have a kernel that uses a global read only uint array where each work item read 18 address sequentially. i.e. All work item has a different set of 18 uints.

Problem2.

I also have a global read only array of uint with only 4 element. The 4 uints are broadcast to all work-items.

Now I can't (and also don't want to) use LDS for it both of the problems.

Questions:

1. Is it possible to use L1 in both the cases?

2. I have set the kernel arguments like this as per the OpenCL Programming Guide May 2012 Pg 5-13.

__kernel void mykernel( __global uint const * restrict key, //18 uints per work-item

__global uint const * restrict salt , //4 uint for broadcast

. //other args

.

)

Is there anything else I need to do in order to cache the data in L1?

Thanks,

Sayantan

kbrafford · ‎06-09-2012

Can you change your salt to an int4? Wouldn't an int4 use memory more efficiently and also be guaranteed not to alias (so you don't need to worry about a restricted pointer)?

__global uint4 const salt,
 ...
  uint salt1 = salt.s0;
  uint salt2 = salt.s1;
  uint salt3 = salt.s2;
  uint salt4 = salt.s3;

--Keith

sayantandatta · ‎06-09-2012

Yes I could do that. But would it use L1 cache?

sh2 · ‎06-09-2012

AFAIK, L1 is always enabled on 7970. "const restrict" qualifier is not necessary anymore. Kbrafford's advice is also useful, because salt will be loaded into scalar registers.

kbrafford · ‎06-09-2012

Would it also work better if somehow the key field was packed as an array of int4 types instead? (assuming a clever way to deal with the odd number of total ints needed was found)

--Keith

sayantandatta · ‎06-09-2012

Yes I can pack the key field as five int4. Although it would waste some space but I'm ready to do that if they are cached into L1. They will be read sequentially numerous(more than 1000 times) times in a work item.

sh2 · ‎06-09-2012

18 int per work-item does not fit into L1 cache. You should dispatch at least 256 work-items per CU, so total memory will be 18*4*256 = 18kB.

sayantandatta · ‎06-09-2012

No my implementation is limited by LDS. So a maximum of 16 work-item per CU will be dispatched.

sh2 · ‎06-09-2012

L1 access has hundreds cycles latency even if you get cache hit. Normally this could be hidden by interleave wavefront execution. One wavefront per CU is a really bad idea.

gautam_himanshu · ‎06-11-2012

Hi Sayantan,

There is no way you will get performance with 16work-iterms per CU. A CU must have atleast 2 wavefronts assigned to it(that too when kernels are very compute intensive). If the LDS size is stopping you to launch that many work-items, i would rather go with using global buffers( because most likely, your kernel is memory intensive).

The float4/int4 access are good for non-GCN architecture, I guess scalar loads are good for GCN though. I would suggest you to give more details on what application you are working on. I doubt there will be lot of bank conflicts if you access 18uints per work-item sequentially.

sh2 · ‎06-09-2012

I believe it would work better. unit4/float4 is the best pattern for the L1 cache access. L1 was designed for 16 byte types - they are used by game shaders. Sometimes hardware could coalesce multiple unit accesses into uint4 access. Sometimes compiler could do the same.

MicahVillmow · ‎06-11-2012

Read-only data that is under 64KB in size should go into the constant address space.

Archives Discussions

How to use L1 cache on 7970?