To fully use the GPUs potential, you need to get keep all of the slots in the 5-wide VLIW filled up, right? And to do that, if I use float4's as much as possible, then the compiler will be able to easily fill the 4 single precision floating point slots, correct? What do I do to help the compiler do something useful with fifth slot?
Well, if there is no data dependency, the compiler will fill the 5th slot with the first float of the next float4 operation. So if you have 5 independent vector operations, it will take 4 cycles, not 5 because the 5th operation can be packed into the other 4's 5th slot.
It doesn't work "exactly" like this, but this is the general idea.
There's a good chance the kernel is already using T, the fifth part of the VLIW bundle.
Use Stream Kernel Analyzer to find out.
If the kernel is memory-bound then the key is to make the minimum number of trips to memory and fetch in the biggest possible lumps. So float4 fetches from memory are better than float fetches.
With local memory reads and writes float2s or larger are preferable.
When you are ALU bound then manual loop-unrolling is the way to go, if you have any loops to unroll. There's a huge variety of approaches to loop-unrolling.
If you have no loops to unroll then it's possible to create them by making one work item process multiple elements from the domain of execution. Even a simple loop over four such elements processed by a single work item could be enough.
All of these vectorisation techniques run the risk of being unfriendly for other architectures, e.g. NVidia prefers limited unrolling because of register allocation pressure. ATI will also suffer with too much unrolling, so don't go completely crazy.