AnsweredAssumed Answered

Global memory access and ALU use optimization

Question asked by v3n0w on Mar 14, 2012
Latest reply on Mar 16, 2012 by notzed

Hello guys.


On the global memory access:


I would like to discuss, in simple terms, the global memory access optimization. I need to access the coordinates of a 3D float image. The OpenCL programming guide states:


"On the ATI Radeon HD 5870 GPU, the channel selection are bits 10:8 of the

byte address. This means a linear burst switches channels every 256 bytes.

Since the wavefront size is 64, channel conflicts are avoided if each work-item

in a wave reads a different address from a 64-word region."


This means that I can do a 64 word parallel access to the global memory? And that is, considering words(32 bits) and single access per work-item, the same size of the wave-front... As I was working with NVIDIA before, there is a instruction synchronization size (warp) and a parallel access size (half-warp), maybe that is why Im a little confused.


Having a 64 sized group (1D) this will be ok then:


__kernel void test(__global float * data)
     float x = data[get_local_id(0)];
     float y = data[get_local_id(0)+64];
     float z = data[get_local_id(0)+128];


This way I will do a 64 word parallel access for each coordinate right? Meaning 3 times 256 bytes parallel access, right? This way I'm safely doing the best global memory access?



ALU use optimization:


Profiling my program I get a low ALU use, and a low ALU Packing:




What can I do to improve it? If I use float4 and float4 operations, I improve my ALUPacking? Im using the global memory access that I just described and ensuring the flow-control doesn't diverge within the work-group (64 sized work-group).


Thanks in advance!