I have a workgroup size of 16x16 and I need to write a local array of 3*16*16 floats to global memory for every workgroup. Would it be faster to interpret the array as float4 and use only 192 of the 256 workunits to write one float4 each? How can one reason about stuff like this without writing timing tests? Rule-of-thumbs are greatly appreciated!
Edit: I am doing this on the GPU (HD 6850)