First of all hi to everyone!, I am physicist and I am currently developing a Monte Carlo code for particle transport using OpenCL.
My question is regarding the convenience of using vector data types (like float4) on AMD CPUs and GPUs. Reading some documentation I have become some confused about this issue, for example, in my iMac (CPU: Intel(R) Core(TM) i5-6500 and GPU: AMD Radeon R9 M380) from the CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT attribute I obtain 4 for the CPU and 1 for the GPU.
From the result for the GPU my question arises, which is then the advantage of using vector data types on GPUs?. I have read that nowadays GPUs use mostly scalar processors and therefore do not benefit from the use of vector data types on arithmetic computation, is that also true for loading/storing data from/to global memory?.
Currently in my code I store the particle attributes (position, velocity, energy, etc) as float4 data types, should I keep doing that or I should just use plain floats?. Thanks for your help!.
AMD optimization guide says:
"(On GCN) vectorization is no longer needed, nor desirable; in fact, it can affect performance. It is recommended not to combine work-items."
Here last line is important. It effectively says that, as ALU and VGPRs are scalar on GCN, the programmers no longer need to use explicit vectorization to combine workload of multiple work-items. Vectorization was more preferred on earlier VLIW architecture.
Now coming to the float4 data type question. If the application accesses or operates on all the components/fields at once, it makes sense to put them closely in a vector. For example, "NBody" sample in AMD APP SDK uses float4 to represent position and velocity of a particle in X, Y, Z direction.
There is an additional benefit of using float4. Modern CPUs contain vector units such as SSE, AVX which can be efficiently used if vector data types are used. Using four-wide vector types (int4, float4, etc.) is preferred on AMD CPUs. So, the same kernel might also run well on AMD CPUs.
However, I would like to mention one important point here. There are certain cases where float4 or larger can exhibit lower performances. For example, use of float4 can lead to local memory (LDS) bank-conflicts on AMD gpus. Hence, performance analysis is very important to discover bottlenecks and find ways to optimize the application’s