The optimization chapter in the programming guide is pretty thorough. Do you feel that's not enough?
A few more performance samples would be helpful, though, I agree.
http://developer.amd.com/gpu/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Section 4-31 talks about float4, for example.