For questions like yours you may find it very helpful to try out different ideas using the AMD APP KernelAnalyzer, at least to compare memory accesses using float4 versa float.
I would also consider using the built-in work_group_copy or async_work_group_copy (there are strided versions as well). That way you don't need to worry too much about the detail and the compiler/run-time will--hopefully--choose the optimal path.
Unfortunately in this case there aren't very clear rules to go by since many speed factors have both pros and cons. As a rough example, global memory access for your device are usually best when accessed using float4, which is one memory instruction as compared to four if using just float, but then local memory accesses suffer from bank conflicts (having 32 banks that are 4-bytes wide and 64 work-items in a wavefront accessing 4 consecutive floats each, thus the local memory access is serialized four times). Now this trade-off may still be worth it, but usually you have to benchmark it to see.