in the "ATI Stream SDK v2.01 Performance and Optimization" document, I noticed the following suggestion:
Vectorize. AMD GPU hardware is fundamentally a five-wide VLIW unit. Vectorization can
lead to substantially greater efficiency. The ALUPacking counter ...
does that mean using swizzle operators whenever possible is helpful to improve the speed? I know OpenCL does not support float3, I used float4 in most cases, will it be possible if I use float4 var; var.xyz=... to do the calculation can gain any benefits in speed?