is it possible to have wide memory access without declaring the buffer as float4? Does anybody know?
I have a kernel similar to this:
__kernel void test(__global float* uav0)
int r = get_global_id(0);
uav0[r] = 0;
uav0[r+1] = 0;
uav0[r+2] = 0;
uav0[r+3] = 0;
The assembler code generated from this doesn't look very optimal. For each memory access, it has one tbuffer_store_format_x instruction. I could work around this by declaring the buffer float4, but I don't want to do that. From what I understand, the program would be considerably faster if the memory was written with one single tbuffer_store_format_xyzw instruction, even if the index is not guaranteed to be aligned. Is there a way to enforce this? I am surprised that the compiler is not able to optimize this on its own.
Thanks for any help