cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

v3n0w
Journeyman III

Global memory access and ALU use optimization

Hello guys.

On the global memory access:

I would like to discuss, in simple terms, the global memory access optimization. I need to access the coordinates of a 3D float image. The OpenCL programming guide states:

"On the ATI Radeon HD 5870 GPU, the channel selection are bits 10:8 of the

byte address. This means a linear burst switches channels every 256 bytes.

Since the wavefront size is 64, channel conflicts are avoided if each work-item

in a wave reads a different address from a 64-word region."

This means that I can do a 64 word parallel access to the global memory? And that is, considering words(32 bits) and single access per work-item, the same size of the wave-front... As I was working with NVIDIA before, there is a instruction synchronization size (warp) and a parallel access size (half-warp), maybe that is why Im a little confused.

Having a 64 sized group (1D) this will be ok then:

__kernel void test(__global float * data)

{

     float x = data[get_local_id(0)];

     float y = data[get_local_id(0)+64];

     float z = data[get_local_id(0)+128];

}

This way I will do a 64 word parallel access for each coordinate right? Meaning 3 times 256 bytes parallel access, right? This way I'm safely doing the best global memory access?

ALU use optimization:

Profiling my program I get a low ALU use, and a low ALU Packing:

ALUBusyALUPacking
3523

What can I do to improve it? If I use float4 and float4 operations, I improve my ALUPacking? Im using the global memory access that I just described and ensuring the flow-control doesn't diverge within the work-group (64 sized work-group).

Thanks in advance!

0 Likes
1 Reply
notzed
Challenger

(since nobody else has answered ...)

Hmm, that bit of the manual is pretty complicated.  From the example there: using float4 is faster than float1 for a simple copy - which will be the most memory intensive, although interestingly it's the same ratio as 4:3, so if you only need 3 floats using 3 separate floats might work out the same.

From the compiler output i've seen memory accesses are grouped into the same clause: so since you're reading 64x3 values in series across the workgroup either way, they may amount to the same thing.  Try each and see perhaps.  Storing x/y/z next to each other or in a float4 (float3 is aligned to float4 in ram) will change the locality of reference, which might be important (but 'intuition' isn't always a useful guide on these devices).

float4 would improve alu packing but only if you're doing more total element operations.  i.e. doing 4 items at once.  But it will also increase register use, bandwidth-per-work item, and reduce parallelism (either because there are 4x fewer work items, or because resources are limited), so it might end up slower: it depends very much on the algorithm.  If you try this then it may make sense to store the x, y, z values as separate float4 'planes' (which could be stored similarly to above).   Using vectors in this way is pretty much the same as unrolling a loop, e.g. doing the same thing 4 times on different values, it's just a cleaner way to express it.  And like loop unrolling it provides more non-dependent work to fill in idle slots.

A rule of thumb I use is: when it makes sense, use int4/float4/etc for memory accesses, as this is the type of workload a graphics card is optimised for (both for memory access and alu load).   And access whatever element you have in 1:1 relationship to the work item (i.e. each next work item accesses the next memory cell).  And try to keep the addressing simple: if the kernel isn't doing much real work, complex addressing calculations can be expensive.

Although the guide goes into excruciating detail about the memory banking (which afaict is 10 bits, not 8??), one also has to remember that for a given task you will have to get a given amount of memory, and you will normally have many processors vying for the memory at the same time anyway.  So you will always get bank conflicts across the device, but so long as you avoid the pathological cases (e.g. large stride access per work item id), this memory access latency is what all those thousands of concurrent threads are designed to hide.