Hello guys.
On the global memory access:
I would like to discuss, in simple terms, the global memory access optimization. I need to access the coordinates of a 3D float image. The OpenCL programming guide states:
"On the ATI Radeon HD 5870 GPU, the channel selection are bits 10:8 of the
byte address. This means a linear burst switches channels every 256 bytes.
Since the wavefront size is 64, channel conflicts are avoided if each work-item
in a wave reads a different address from a 64-word region."
This means that I can do a 64 word parallel access to the global memory? And that is, considering words(32 bits) and single access per work-item, the same size of the wave-front... As I was working with NVIDIA before, there is a instruction synchronization size (warp) and a parallel access size (half-warp), maybe that is why Im a little confused.
Having a 64 sized group (1D) this will be ok then:
__kernel void test(__global float * data)
{
float x = data[get_local_id(0)];
float y = data[get_local_id(0)+64];
float z = data[get_local_id(0)+128];
}
This way I will do a 64 word parallel access for each coordinate right? Meaning 3 times 256 bytes parallel access, right? This way I'm safely doing the best global memory access?
ALU use optimization:
Profiling my program I get a low ALU use, and a low ALU Packing:
What can I do to improve it? If I use float4 and float4 operations, I improve my ALUPacking? Im using the global memory access that I just described and ensuring the flow-control doesn't diverge within the work-group (64 sized work-group).
Thanks in advance!