Archives Discussions

landmann · ‎02-01-2011

Hi,

why causes copying from global into local memory on an RV730 a barrier after each write to the local store? The compiler seems to be smart enough to remove the following explicit barrier at all, but I would rather prefer not to see additional ones but just the one I put.

E.g.:

__kernel void main(__global float2 *dest,
__global const float *data)
{
uint X = get_global_id(0);
uint Y = get_global_id(1);

__local float buffer[4096];

const uint W=1024;
const uint H=480;
uint row = Y * W;

uint myID = get_local_id(0);
buffer[myID] = data[row+myID];
/*buffer[W+myID] = data[row+W+myID];
barrier(CLK_LOCAL_MEM_FENCE); */
dest[row + X] = (float2)(buffer[myID+1],buffer[myID+4]);
}

For the 58xx and 57xx I observed the expected output. No barrier placed if not explicitely requested.

Any technical reason for this behaviour?

Joerg

PS: I realized I can extend the barrier-less writes using vload/vstore, but after 4 or 8 elements a barrier is always inserted At least this reduces the barrier count to 7/8 compared to individual floats.

MicahVillmow · ‎02-01-2011

are you sure you are seeing a barrier and not an ACK? The RV730 does not have a hardware barrier instruction.

landmann · ‎02-02-2011

Yes, it is a wait_ack, which I assumed to be behaving like a barrier. Please correct me, as it seems I have a wrong understanding here.

Nevertheless the question remains, why vloads/vstores result in less wait_acks then moving the same pattern explicitely into local memory.

MicahVillmow · ‎02-02-2011

you have read/write/read/write pattern, so when the switch from read to write occurs an ack is required. If you do read/read then write/write, the ack should disappear.

landmann · ‎02-02-2011

Thanks, that is indeed the pattern which shows up using the vload/vstore instructions.

But why is an ack needed at all after the last write in such a sequence ?

I would like to just have a single barrier after loading a bunch of data (m*float_n , m bigger than get_local_size() ) into the local memory without waiting after each "n" elements copied in a loop.

Is this a technical restriction of my HW (RV730) ?

MicahVillmow · ‎02-02-2011

If the ACK is on the last instruction then you can ignore it as it doesn't cause any performance problems. Only ACK's in the middle of the program cause performance issues.

landmann · ‎02-02-2011

Well, there is a loop around the ACK...I am fetching 16k using 64 (or 128) threads, resulting in: loop { read,read,read,read,write,write,write,write, wait_ack }. Using vload16 allows even more reads/writes followed by a single wait_ack only in the loop's body, but because the read pattern is not nicely spread amongst all participating threads, the overall performance is much lower.

Joerg

MicahVillmow · ‎02-02-2011

The ack is required because after the last write the next iterations first memory operation is a read.

Archives Discussions

Barriers when reading to local memory