I have a workgroup size of 16x16 and I need to write a local array of 3*16*16 floats to global memory for every workgroup. Would it be faster to interpret the array as float4 and use only 192 of the 256 workunits to write one float4 each? How can one reason about stuff like this without writing timing tests? Rule-of-thumbs are greatly appreciated!
Edit: I am doing this on the GPU (HD 6850)
For questions like yours you may find it very helpful to try out different ideas using the AMD APP KernelAnalyzer, at least to compare memory accesses using float4 versa float.
I would also consider using the built-in work_group_copy or async_work_group_copy (there are strided versions as well). That way you don't need to worry too much about the detail and the compiler/run-time will--hopefully--choose the optimal path.
Unfortunately in this case there aren't very clear rules to go by since many speed factors have both pros and cons. As a rough example, global memory access for your device are usually best when accessed using float4, which is one memory instruction as compared to four if using just float, but then local memory accesses suffer from bank conflicts (having 32 banks that are 4-bytes wide and 64 work-items in a wavefront accessing 4 consecutive floats each, thus the local memory access is serialized four times). Now this trade-off may still be worth it, but usually you have to benchmark it to see.
Thanks! I did not know about work_group_copy, I'll probably just use that rather than my own version of the same thing.