Hi,
why causes copying from global into local memory on an RV730 a barrier after each write to the local store? The compiler seems to be smart enough to remove the following explicit barrier at all, but I would rather prefer not to see additional ones but just the one I put.
E.g.:
__kernel void main(__global float2 *dest,
__global const float *data)
{
uint X = get_global_id(0);
uint Y = get_global_id(1);
__local float buffer[4096];
const uint W=1024;
const uint H=480;
uint row = Y * W;
uint myID = get_local_id(0);
buffer[myID] = data[row+myID];
/*buffer[W+myID] = data[row+W+myID];
barrier(CLK_LOCAL_MEM_FENCE); */
dest[row + X] = (float2)(buffer[myID+1],buffer[myID+4]);
}
For the 58xx and 57xx I observed the expected output. No barrier placed if not explicitely requested.
Any technical reason for this behaviour?
Joerg
PS: I realized I can extend the barrier-less writes using vload/vstore, but after 4 or 8 elements a barrier is always inserted At least this reduces the barrier count to 7/8 compared to individual floats.
Yes, it is a wait_ack, which I assumed to be behaving like a barrier. Please correct me, as it seems I have a wrong understanding here.
Nevertheless the question remains, why vloads/vstores result in less wait_acks then moving the same pattern explicitely into local memory.
Thanks, that is indeed the pattern which shows up using the vload/vstore instructions.
But why is an ack needed at all after the last write in such a sequence ?
I would like to just have a single barrier after loading a bunch of data (m*float_n , m bigger than get_local_size() ) into the local memory without waiting after each "n" elements copied in a loop.
Is this a technical restriction of my HW (RV730) ?
Well, there is a loop around the ACK...I am fetching 16k using 64 (or 128) threads, resulting in: loop { read,read,read,read,write,write,write,write, wait_ack }. Using vload16 allows even more reads/writes followed by a single wait_ack only in the loop's body, but because the read pattern is not nicely spread amongst all participating threads, the overall performance is much lower.
Joerg