Hi,
As per your intention, a four component vector type would be ideal for your case. To achieve optimum performance, you need to consider two main points before choosing the size of the vector type: 1) the max. number of LDS bytes that can be requested by each stream core per cycle [the actual number depends on gpu architecture] and 2) memory bank requests served by LDS / cycle. For example, if you go through the following section in APP Programming Guide, it says:
6. OpenCL Performance and Optimization for GCN Devices->6.2 Local Memory (LDS) Optimization
A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon. HD 7XXX GPU and delivers half the performance of the float access pattern.
|
Now, suppose that you've chosen each color channel as float/int, so, it will exhibit same problem as reading of float4 or int4 value from LDS and accordingly performance will be half.
Regards,