Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

Fastest way to copy from global memory to local memory


I need to copy 32 uint from global memory to local memory.

Maybe I can try vload16, or something else ?


int16 i1_16 = vload16(offset + idx, in); int16 i2_16 = vload16(offset + idx + 64, in); ((int16*)localBuf)[0] = i1_16; ((int16*)localBuf)[1] = i1_16;

6 Replies
Adept II

In fact I need to load 32xint and process them, so I do this :


size_t idx = get_local_id(0); if (idx & 15 == 0) ((int16*)localBuf)[0] = vload16(0, in + idx); Notice that "in" is a "int*" This way I try to avoid bank conflict and to serialize too much memory access. But, strangly my algorithm is not improved !!!! And even, I got wrong results !!!

Journeyman III

If copying these 32 integers are the only data you need to get from gobal buffer to local one then the following code might be better:



size_t idx = get_local_id(0); if (idx < 32) localBuf[idx] = in[idx];


Each work-item will retreive one value then you will have bank conflict between :

banks 0 & 4 & 8 & 12 & 16 & 22 ..

banks 1 & 5 & 9 & 13 & 17 & 23 ..


Right ?


1. What bank conflicts? When reading from global memory or when writing to local one?

2. Try it.

3. Read the "AMD Accelerated Parallel Processing, OpenCL Programming Guide" document, as I already suggested you in one of your posts. Paragraph 4.6.2 Channel Conflicts. It says:

If every work-item in a work-group references consecutive memory addresses, the entire wavefront accesses one channel. Although this seems slow, it actually is a fast pattern because it is necessary to consider the memory access over the entire device, not just a single wavefront.


I don't know how it work with ATI... but it is how it works with NVidia cards 😞

So, I search for a fast and portable way to transfer the whole set of data in a minimum set of instructions !


Well the right way to copy n data items using m work items is something like:

float4 *data = (float4*)(data);

for( address = data+ get_local_id(0)*4; address < base + n; address += get_group_size(0)*4 ) {

  float4 a = *address;

  local[somewhere] = a;



You'll get conflicts that way but that hardly matters because you're global bandwidth limited. You could get rid of the *4s and cut it down to float reads and do it without conflicts if you like, but you'll make much worse use of the memory system so I doubt that would be faster. If you do a write to base + get_local_id(0) % wave_size then you won't get any bank conflicts. It's a vector read and a vector write, nice and efficient and the perfect way to use a banked memory system - that's what it's designed for on a vector architecture.

Once it's in local memory do what you like with it.


This kind of thing is really the same between nvidia and AMD architectures. They're both vector machiens, they work in largely the same way from a high level. It's just AMD has a wider vector and issues multiple vector instructions at the same time VLIW fashion.