Is 128 bit the optimal read size when reading vector. The pixel format for the source image could be 8 bit, 16bit, or even 32bit. If I call read_image on a 16 bit source, it will read 16x4=64 bit back, is this something less efficient? If yes, is there a way to read 8 items at a time to make it 128 bit?
Actually reading 128 bit (float4/int4 vector) is a way to utilize global memory bandwidth using buffer objects. When you're using images there is more complicated mechanism of interaction with global memory - image read are cached through texture system (L2 and L1 cache unlike buffer access that uses only L1 cache). Moreover images are stored in special format (Z-order) in GPU so other access patterns can bring performance improvement.
AMD Performance Guide says that using float4 with images can deliver higher L1 cache bandwidth and that's why the best overall performance. But there if no information about internal format of image in GPU memory (at least I cannot find it), and that's why i cannot say how much bits are actually read from memory. Perhaps you can use normalized image formats like CL_UNORM_INT8, CL_UNORM_INT16, CL_UNORM_INT32, etc. to provide image access with float4 values and the best accuracy.
There is no way to read 8 items at ones if you're using images.
As advice I can offer you to use buffers and completely control memory access to provide the best performans with any image format and number of channels.
When image is passed from host, the ImageFormat can be specified with format such as CL_UNSIGNED_INT16, CL_HALF_FLOAT, etc. If it is read into int4 or float4, I suppose it is effectively reading 64 bits at a time and upscaled - it seems there is no way to read into short4 or half4. My question is, is this an effective way of reading?
I can not find any document for how image is cached, can you explain the z-order mean here? Why do you think normalized image format would provide better accuracy? The input values are already in the range so why normalize them?
For your last advice, are you talking about using buffers instead of image? Since we are accessing full HD image, a 2D image access would be very convenient and more cache efficient, I guess. Is 'CacheHit' a good performance counter to see if we read the image effectively? Does the hit ratio include L2 cache hit too? Can you explain 'MemUnitBusy' and 'MemUnitStalled'?
It seems to me that you want to read 128 bits continuous data once, I feel you can set the pixel format of the image as CL_UNSIGNED_INT32, then
int4 val = read_imageui(img, sampler, coord);
short4 v1 = convert_short4(val >> 16);
short4 v2 = convert_short4(val);
If they are continuous.