cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eklund_n
Journeyman III

GPU vs. CPU execution, different barrier needs. Why?

Hi.

I've come across a case where execution between GPU and CPU differ. I believe that it is the x86 compiler that isn't respecting the implicit work item barriers.

The work group has 2 __local uchar4, buff0[4] and buff1[4]. Each work item does work on 1 uchar4, buff0[local_id] or buff1[local_id]. At one time it uses values outside 'local_id', therefore the need of 2 local buffers, ie. buff1[local_id] = buff0[other_id].

When only using values at 'local_id' I just update the same local buffer, ie. buff0[local_id] = swizzle(buff0[local_id]). This works and the correct value is returned. However, if I change local buffer when only using values from 'local_id' i get the wrong value returned.

Look in the code below. I think it is the x86 compiler that gets ahead of it self. It works as planned on GPU. Is this a correct behavior?



uchar4 working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; buff0[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff0[local_id]; } uchar4 NON_working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; //here I change to buff1 just to test the compiler buff1[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff1[local_id]; //gives right result from 'local_id' = 0, wrong result from 'local_id' = 1 to 3 } __kernel foo(__global uchar4 *out, __global uchar *in) { global_id = get_global_id(0); __local uchar4 buff0[4], buff1[4]; out[global_id] = working(buff0, buff1, in[global_id]); }

0 Likes
12 Replies
nou
Exemplar

OpenCL doesn't define implicit synchronization in workgroup.

0 Likes

I know. That's why I wrote "work item barriers".

The thing is this: Since every work item updates 4 values (uchar4), I expect all 4 values to be updated before the work item uses them in the next line of code. No other work item affects these buffer elements.

Even though the CPU doesn't have execution possibility of a VLIW I expect the following code to work the same on CPU and GPU.

uchar4 foo, bar; foo = (uchar4)(0x34,0x29,0xd3,0x1b); //if CPU can't assign all these values at once, add implicit barrier before next line bar = foo.zwxy;//do not execut this before all vector components have been assigned return bar;

0 Likes

What is your workgroup size and your global work size and how many CPU cores do you use?

0 Likes

workgroup size = 64
global work size = 64*n (n=1,2,..)
4 CPU cores with HT (8 viewable in system monitor)

so in real app the local buffers are __local uchar4 buff0[64], buff1[64]. 

0 Likes

Now that we have established that you didn't mess up te buffer sizes

I think we should look at the "working code", especially the part where you use the "other index".

Does it mean you get the wrong result for local_id=1..WORK_GROUP_SIZE-1  ??

0 Likes

edit. change the switch cases and buff0.x to buff0[local_id].x

OK. Each work item writes only to local_id in buff0[64] or buff1[64]. Every group of 4 at a few times reads each others uchar4's. Therefore the pattern follows a period of 4, ie. local_id = 0,4,8,.. gives right result all other wrong result.

Added the code I actually use.

The strange thing is that it is not the swizzle code line (buff1[local_id] = swizzle()) that gives wrong result. It is the line above, but only if I the next line changes local buffer.

__global uchar *new_byte = {256 uchar values} uchar4 swizzle(uchar4 c, uint steps) { uchar4 s; switch(steps) { case 0: s = c.xyzw; break; case 1: s = c.yzwx; break; case 2: s = c.zwxy; break; case 3: s = c.wxyz; break; default: break; } return s; } uchar4 working(__global uchar *byte_exchange, __local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); uint nbr_of_steps = local_id%4; /* working code */ //here buff0[local_id] always has the right values buff0[local_id] = (uchar4)(new_byte[buff0[local_id].x],new_byte[buff0[local_id].y],new_byte[buff0[local_id].z],new_byte[buff0[local_id].w]); //still right values in buff0[local_id] buff0[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff0[local_id]; //returns correct value } uchar4 NON_working(__global uchar *new_byte, __local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); uint nbr_of_steps = local_id%4; /* working code */ //here buff0[local_id] always has the right values buff0[local_id] = (uchar4)(new_byte[buff0[local_id].x],new_byte[buff0[local_id].y],new_byte[buff0[local_id].z],new_byte[buff0[local_id].w]); //now the values in buff0[local_id] are wrong!! just by changing to buff1 below buff1[local_id] = swizzle(buff0[local_id], nbr_of_steps); //still calculates right, just on wrong input return buff1[local_id]; //returns wrong value }

0 Likes

I have not figured it out yet.

But there are two things that don't make sense:

 

new_byte[buff0.x]   while buff0 is a pointer

and all your case expressions are "0" as in zero. But I suspect that are just copy & paste errors.

 

the periodicity of the error suggests it is connected to "nbr_of_steps" which jumps to default for 1,2,3 and your default branch does nothing useful. That might be it. Simple programming mistake, nothing to do with synchronization.

 

 

0 Likes

the new_byte does byte substitution. it takes the vector components (buff0.x etc.) as index in the new_byte array, returns the value stored there. not very self explanatory, i know.

as you said, the case expressions were copy-pasted in this forum, but all the time right in the source files. therefore not the reason. as it is now is correct.

thank you for pondering over this.

0 Likes

buff0.x is not a vector component

buff0[local_id].x ??

I'll be back on monday.

 

0 Likes

oops. of course it should be buff0[local_id].x, changed in the code 4 posts up. missed it when typing it in. but it has been right in the source code all the time.

now that we have all the typos corrected, maybe we can find what's causing it..

0 Likes

hi eklund.n,

The execution can depend on what values are stored in new_byte array.And how you are concluding the correct and incorrect values.Also post the host code.

 

Can you provide a complete test case.You can also send it to streamdeveloper@amd.com.

0 Likes

Complete test case have been sent to streamdeveloper@amd.com. Thanks.

0 Likes