As per your intention, a four component vector type would be ideal for your case. To achieve optimum performance, you need to consider two main points before choosing the size of the vector type: 1) the max. number of LDS bytes that can be requested by each stream core per cycle [the actual number depends on gpu architecture] and 2) memory bank requests served by LDS / cycle. For example, if you go through the following section in APP Programming Guide, it says:
6. OpenCL Performance and Optimization for GCN Devices->6.2 Local Memory (LDS) Optimization
A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon. HD 7XXX GPU and delivers half the performance of the float access pattern.
Now, suppose that you've chosen each color channel as float/int, so, it will exhibit same problem as reading of float4 or int4 value from LDS and accordingly performance will be half.