eklund.n

GPU vs. CPU execution, different barrier needs. Why?

Discussion created by eklund.n on Oct 15, 2010
Latest reply on Oct 18, 2010 by eklund.n

 

Hi.

I've come across a case where execution between GPU and CPU differ. I believe that it is the x86 compiler that isn't respecting the implicit work item barriers.

The work group has 2 __local uchar4, buff0[4] and buff1[4]. Each work item does work on 1 uchar4, buff0[local_id] or buff1[local_id]. At one time it uses values outside 'local_id', therefore the need of 2 local buffers, ie. buff1[local_id] = buff0[other_id].

When only using values at 'local_id' I just update the same local buffer, ie. buff0[local_id] = swizzle(buff0[local_id]). This works and the correct value is returned. However, if I change local buffer when only using values from 'local_id' i get the wrong value returned.

Look in the code below. I think it is the x86 compiler that gets ahead of it self. It works as planned on GPU. Is this a correct behavior?



uchar4 working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; buff0[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff0[local_id]; } uchar4 NON_working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; //here I change to buff1 just to test the compiler buff1[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff1[local_id]; //gives right result from 'local_id' = 0, wrong result from 'local_id' = 1 to 3 } __kernel foo(__global uchar4 *out, __global uchar *in) { global_id = get_global_id(0); __local uchar4 buff0[4], buff1[4]; out[global_id] = working(buff0, buff1, in[global_id]); }

Outcomes