As per section 6. OpenCL Performance and Optimization for GCN Devices->6.2 Local Memory (LDS) Optimization in APP Programming guide
A simple sequential address pattern, where each work-item reads a float2 value from LDS, generates a conflict-free access pattern on the AMD Radeon HD 7XXX GPU. Note that a sequential access pattern, where each work-item reads a float4 value from LDS, uses only half the banks on each cycle on the AMD Radeon. HD 7XXX GPU and delivers half the performance of the float access pattern.
I guess, in case of kernel1, vload4 has same effect as reading float4 value and accordingly performance will be half.