AMD optimization guide says:
"(On GCN) vectorization is no longer needed, nor desirable; in fact, it can affect performance. It is recommended not to combine work-items." |
Here last line is important. It effectively says that, as ALU and VGPRs are scalar on GCN, the programmers no longer need to use explicit vectorization to combine workload of multiple work-items. Vectorization was more preferred on earlier VLIW architecture.
Now coming to the float4 data type question. If the application accesses or operates on all the components/fields at once, it makes sense to put them closely in a vector. For example, "NBody" sample in AMD APP SDK uses float4 to represent position and velocity of a particle in X, Y, Z direction.
There is an additional benefit of using float4. Modern CPUs contain vector units such as SSE, AVX which can be efficiently used if vector data types are used. Using four-wide vector types (int4, float4, etc.) is preferred on AMD CPUs. So, the same kernel might also run well on AMD CPUs.
However, I would like to mention one important point here. There are certain cases where float4 or larger can exhibit lower performances. For example, use of float4 can lead to local memory (LDS) bank-conflicts on AMD gpus. Hence, performance analysis is very important to discover bottlenecks and find ways to optimize the application’s
performance.
Regards,