Unless I am mistaken, it seems to me in the sk1.2.1 version I have that hellocal.cpp uses float (not float4) for input/output buffers. It uses only 1 float4 element for constant buffer (to carry possibly 4 float constants).
Yes, you are right. But what I meant is that there are 256X256 threads each with o0 of four components. How come the output is 256X256 floats not 256X256X4 floats? There must be some hidden stuff, which probably says only o0.x is effective. I just can't explain it.