why causes copying from global into local memory on an RV730 a barrier after each write to the local store? The compiler seems to be smart enough to remove the following explicit barrier at all, but I would rather prefer not to see additional ones but just the one I put.
__kernel void main(__global float2 *dest,
__global const float *data)
uint X = get_global_id(0);
uint Y = get_global_id(1);
__local float buffer;
const uint W=1024;
const uint H=480;
uint row = Y * W;
uint myID = get_local_id(0);
buffer[myID] = data[row+myID];
/*buffer[W+myID] = data[row+W+myID];
dest[row + X] = (float2)(buffer[myID+1],buffer[myID+4]);
For the 58xx and 57xx I observed the expected output. No barrier placed if not explicitely requested.
Any technical reason for this behaviour?
PS: I realized I can extend the barrier-less writes using vload/vstore, but after 4 or 8 elements a barrier is always inserted At least this reduces the barrier count to 7/8 compared to individual floats.