cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

josopait
Journeyman III

Wide memory access

Hello,

is it possible to have wide memory access without declaring the buffer as float4? Does anybody know?

I have a kernel similar to this:

__kernel void test(__global float* uav0)

{

  int r = get_global_id(0);

  uav0 = 0;

  uav0[r+1] = 0;

  uav0[r+2] = 0;

  uav0[r+3] = 0;

}

The assembler code generated from this doesn't look very optimal. For each memory access, it has one tbuffer_store_format_x instruction. I could work around this by declaring the buffer float4, but I don't want to do that. From what I understand, the program would be considerably faster if the memory was written with one single tbuffer_store_format_xyzw instruction, even if the index is not guaranteed to be aligned. Is there a way to enforce this? I am surprised that the compiler is not able to optimize this on its own.

Thanks for any help

Ingo Josopait

0 Likes
1 Solution
Bdot
Adept III

Would it help if every thread wrote different addresses instead of overlapping? I.e.

r = get_global_id(0)*4;

I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.

You can also use vstore to explicitly write a number of bytes. e.g.

vstore4( (float4)0, r, uav0);

(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)

View solution in original post

0 Likes
5 Replies
binying
Challenger

Set the values before hand in the buffer since they are constants?...

Use float2? ...

0 Likes
Bdot
Adept III

Would it help if every thread wrote different addresses instead of overlapping? I.e.

r = get_global_id(0)*4;

I have similar code for loading 4 single uints from global memory, and that is nicely packed by the compiler.

You can also use vstore to explicitly write a number of bytes. e.g.

vstore4( (float4)0, r, uav0);

(Though I'm not certain if you can use a constant for the first parameter ... - give it a try)

0 Likes

Thanks Bdot!

I meant non-overlapping memory writes. My mistake.

With

r = get_global_id(0)*4;

the compiler does actually optimize the code nicely. However, if the address is more complex than in this simple test case and the compiler cannot infer that the access is non-overlapping, then it won't pack the memory access anymore. So I think I will use vstore4 instead, which seems to do what I want.

0 Likes

maybe to store/load float4 is needed to have aligned address.

0 Likes
settle
Challenger

Have you considered using async_work_group_copy()?  I know it's for copying between local and global addresses, so maybe you need to create some local temp variable, but it's worth a shot to see if the compiler will optimize better with it.

0 Likes