12 Replies Latest reply on Oct 18, 2010 12:00 PM by eklund.n

    GPU vs. CPU execution, different barrier needs. Why?

    eklund.n

       

      Hi.

      I've come across a case where execution between GPU and CPU differ. I believe that it is the x86 compiler that isn't respecting the implicit work item barriers.

      The work group has 2 __local uchar4, buff0[4] and buff1[4]. Each work item does work on 1 uchar4, buff0[local_id] or buff1[local_id]. At one time it uses values outside 'local_id', therefore the need of 2 local buffers, ie. buff1[local_id] = buff0[other_id].

      When only using values at 'local_id' I just update the same local buffer, ie. buff0[local_id] = swizzle(buff0[local_id]). This works and the correct value is returned. However, if I change local buffer when only using values from 'local_id' i get the wrong value returned.

      Look in the code below. I think it is the x86 compiler that gets ahead of it self. It works as planned on GPU. Is this a correct behavior?



      uchar4 working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; buff0[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff0[local_id]; } uchar4 NON_working(__local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); /* working code */ buff0[local_id] = buff0[local_id]*4; //here I change to buff1 just to test the compiler buff1[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff1[local_id]; //gives right result from 'local_id' = 0, wrong result from 'local_id' = 1 to 3 } __kernel foo(__global uchar4 *out, __global uchar *in) { global_id = get_global_id(0); __local uchar4 buff0[4], buff1[4]; out[global_id] = working(buff0, buff1, in[global_id]); }

        • GPU vs. CPU execution, different barrier needs. Why?
          nou

          OpenCL doesn't define implicit synchronization in workgroup.

            • GPU vs. CPU execution, different barrier needs. Why?
              eklund.n

              I know. That's why I wrote "work item barriers".

              The thing is this: Since every work item updates 4 values (uchar4), I expect all 4 values to be updated before the work item uses them in the next line of code. No other work item affects these buffer elements.

              Even though the CPU doesn't have execution possibility of a VLIW I expect the following code to work the same on CPU and GPU.

              uchar4 foo, bar; foo = (uchar4)(0x34,0x29,0xd3,0x1b); //if CPU can't assign all these values at once, add implicit barrier before next line bar = foo.zwxy;//do not execut this before all vector components have been assigned return bar;

                • GPU vs. CPU execution, different barrier needs. Why?
                  redditisgreat

                  What is your workgroup size and your global work size and how many CPU cores do you use?

                    • GPU vs. CPU execution, different barrier needs. Why?
                      eklund.n

                      workgroup size = 64
                      global work size = 64*n (n=1,2,..)
                      4 CPU cores with HT (8 viewable in system monitor)

                      so in real app the local buffers are __local uchar4 buff0[64], buff1[64]. 

                        • GPU vs. CPU execution, different barrier needs. Why?
                          redditisgreat

                          Now that we have established that you didn't mess up te buffer sizes

                          I think we should look at the "working code", especially the part where you use the "other index".

                          Does it mean you get the wrong result for local_id=1..WORK_GROUP_SIZE-1  ??

                            • GPU vs. CPU execution, different barrier needs. Why?
                              eklund.n

                              edit. change the switch cases and buff0.x to buff0[local_id].x

                              OK. Each work item writes only to local_id in buff0[64] or buff1[64]. Every group of 4 at a few times reads each others uchar4's. Therefore the pattern follows a period of 4, ie. local_id = 0,4,8,.. gives right result all other wrong result.

                              Added the code I actually use.

                              The strange thing is that it is not the swizzle code line (buff1[local_id] = swizzle()) that gives wrong result. It is the line above, but only if I the next line changes local buffer.

                              __global uchar *new_byte = {256 uchar values} uchar4 swizzle(uchar4 c, uint steps) { uchar4 s; switch(steps) { case 0: s = c.xyzw; break; case 1: s = c.yzwx; break; case 2: s = c.zwxy; break; case 3: s = c.wxyz; break; default: break; } return s; } uchar4 working(__global uchar *byte_exchange, __local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); uint nbr_of_steps = local_id%4; /* working code */ //here buff0[local_id] always has the right values buff0[local_id] = (uchar4)(new_byte[buff0[local_id].x],new_byte[buff0[local_id].y],new_byte[buff0[local_id].z],new_byte[buff0[local_id].w]); //still right values in buff0[local_id] buff0[local_id] = swizzle(buff0[local_id], nbr_of_steps); return buff0[local_id]; //returns correct value } uchar4 NON_working(__global uchar *new_byte, __local uchar4 *buff0, __local uchar4 *buff1, uchar4 in) { uint local_id = get_local_id(0); uint nbr_of_steps = local_id%4; /* working code */ //here buff0[local_id] always has the right values buff0[local_id] = (uchar4)(new_byte[buff0[local_id].x],new_byte[buff0[local_id].y],new_byte[buff0[local_id].z],new_byte[buff0[local_id].w]); //now the values in buff0[local_id] are wrong!! just by changing to buff1 below buff1[local_id] = swizzle(buff0[local_id], nbr_of_steps); //still calculates right, just on wrong input return buff1[local_id]; //returns wrong value }