Archives Discussions

josopait · ‎11-28-2012

Hello,

is it possible to have wide memory access without declaring the buffer as float4? Does anybody know?

I have a kernel similar to this:

__kernel void test(__global float* uav0)

{

int r = get_global_id(0);

uav0 = 0;

uav0[r+1] = 0;

uav0[r+2] = 0;

uav0[r+3] = 0;

}

The assembler code generated from this doesn't look very optimal. For each memory access, it has one tbuffer_store_format_x instruction. I could work around this by declaring the buffer float4, but I don't want to do that. From what I understand, the program would be considerably faster if the memory was written with one single tbuffer_store_format_xyzw instruction, even if the index is not guaranteed to be aligned. Is there a way to enforce this? I am surprised that the compiler is not able to optimize this on its own.

Thanks for any help

Ingo Josopait

Bdot · ‎11-29-2012

Would it help if every thread wrote different addresses instead of overlapping? I.e.

r = get_global_id(0)*4;

I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.

You can also use vstore to explicitly write a number of bytes. e.g.

vstore4( (float4)0, r, uav0);

(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)

View solution in original post

binying · ‎11-28-2012

Set the values before hand in the buffer since they are constants?...

Use float2? ...

Bdot · ‎11-29-2012

Would it help if every thread wrote different addresses instead of overlapping? I.e.

r = get_global_id(0)*4;

I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.

You can also use vstore to explicitly write a number of bytes. e.g.

vstore4( (float4)0, r, uav0);

(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)

josopait · ‎11-29-2012

Thanks Bdot!

I meant non-overlapping memory writes. My mistake.

With

r = get_global_id(0)*4;

the compiler does actually optimize the code nicely. However, if the address is more complex than in this simple test case and the compiler cannot infer that the access is non-overlapping, then it won't pack the memory access anymore. So I think I will use vstore4 instead, which seems to do what I want.

nou · ‎11-29-2012

maybe to store/load float4 is needed to have aligned address.

settle · ‎11-29-2012

Have you considered using async_work_group_copy()? I know it's for copying between local and global addresses, so maybe you need to create some local temp variable, but it's worth a shot to see if the compiler will optimize better with it.

Archives Discussions

Wide memory access