Hello,
is it possible to have wide memory access without declaring the buffer as float4? Does anybody know?
I have a kernel similar to this:
__kernel void test(__global float* uav0)
{
int r = get_global_id(0);
uav0
uav0[r+1] = 0;
uav0[r+2] = 0;
uav0[r+3] = 0;
}
The assembler code generated from this doesn't look very optimal. For each memory access, it has one tbuffer_store_format_x instruction. I could work around this by declaring the buffer float4, but I don't want to do that. From what I understand, the program would be considerably faster if the memory was written with one single tbuffer_store_format_xyzw instruction, even if the index is not guaranteed to be aligned. Is there a way to enforce this? I am surprised that the compiler is not able to optimize this on its own.
Thanks for any help
Ingo Josopait
Solved! Go to Solution.
Would it help if every thread wrote different addresses instead of overlapping? I.e.
r = get_global_id(0)*4;
I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.
You can also use vstore to explicitly write a number of bytes. e.g.
vstore4( (float4)0, r, uav0);
(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)
Set the values before hand in the buffer since they are constants?...
Use float2? ...
Would it help if every thread wrote different addresses instead of overlapping? I.e.
r = get_global_id(0)*4;
I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.
You can also use vstore to explicitly write a number of bytes. e.g.
vstore4( (float4)0, r, uav0);
(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)
Thanks Bdot!
I meant non-overlapping memory writes. My mistake.
With
r = get_global_id(0)*4;
the compiler does actually optimize the code nicely. However, if the address is more complex than in this simple test case and the compiler cannot infer that the access is non-overlapping, then it won't pack the memory access anymore. So I think I will use vstore4 instead, which seems to do what I want.
maybe to store/load float4 is needed to have aligned address.
Have you considered using async_work_group_copy()? I know it's for copying between local and global addresses, so maybe you need to create some local temp variable, but it's worth a shot to see if the compiler will optimize better with it.