1 of 1 people found this helpful
(since nobody else has answered ...)
Hmm, that bit of the manual is pretty complicated. From the example there: using float4 is faster than float1 for a simple copy - which will be the most memory intensive, although interestingly it's the same ratio as 4:3, so if you only need 3 floats using 3 separate floats might work out the same.
From the compiler output i've seen memory accesses are grouped into the same clause: so since you're reading 64x3 values in series across the workgroup either way, they may amount to the same thing. Try each and see perhaps. Storing x/y/z next to each other or in a float4 (float3 is aligned to float4 in ram) will change the locality of reference, which might be important (but 'intuition' isn't always a useful guide on these devices).
float4 would improve alu packing but only if you're doing more total element operations. i.e. doing 4 items at once. But it will also increase register use, bandwidth-per-work item, and reduce parallelism (either because there are 4x fewer work items, or because resources are limited), so it might end up slower: it depends very much on the algorithm. If you try this then it may make sense to store the x, y, z values as separate float4 'planes' (which could be stored similarly to above). Using vectors in this way is pretty much the same as unrolling a loop, e.g. doing the same thing 4 times on different values, it's just a cleaner way to express it. And like loop unrolling it provides more non-dependent work to fill in idle slots.
A rule of thumb I use is: when it makes sense, use int4/float4/etc for memory accesses, as this is the type of workload a graphics card is optimised for (both for memory access and alu load). And access whatever element you have in 1:1 relationship to the work item (i.e. each next work item accesses the next memory cell). And try to keep the addressing simple: if the kernel isn't doing much real work, complex addressing calculations can be expensive.
Although the guide goes into excruciating detail about the memory banking (which afaict is 10 bits, not 8??), one also has to remember that for a given task you will have to get a given amount of memory, and you will normally have many processors vying for the memory at the same time anyway. So you will always get bank conflicts across the device, but so long as you avoid the pathological cases (e.g. large stride access per work item id), this memory access latency is what all those thousands of concurrent threads are designed to hide.