5 Replies Latest reply on Feb 14, 2012 4:27 PM by notzed

    Numerical computations indeterministic?

    pwvdendr

      Do all shader cores of a GPU handle numerical computations in exactly the same way? I've hit upon a strange phenomenon where I execute the same kernel several times on the same deterministic input, but the output varies by a slight fraction. This happens only on my GPU (HD5450), not when I executing on any of my (Intel, multicore) CPUs.

       

      I've been thinking for a while that it should be some concurrency issue, but the fact that it does not seem to occur on CPU at all (even multicore) and the fact that after staring at this tiny kernel for many days I still don't see it, makes me wonder if there is no different explaining after all.

       

      Asimple NetBeans project which triggers the instability on GPU can be found in this link: http://dl.dropbox.com/u/3060536/WeirdOpenCLbug.rar (the kernel is in ./src/bug.cl).

        • Re: Numerical computations indeterministic?
          notzed

          Yes - of course every thread gets identical ALU results.  There does seem to be some non-determinism wrt things like reading/writing the same local address though.

           

          Its likely to be something to do with the barriers and local memory since cpu's dont have either they emulate it using completely serial code.  And/or reading uninitialised local memory which is very likely to change between invocations on a gpu but possibly unchanging on a cpu.

           

          After about 30 minutes staring at this and getting lost in all those lookup tables, (i'm bored and avoiding housework) I think this is your problem:

           

          for (int step=1, div=1; step<lengte; step<<=1, div++) {

                      barrier(CLK_LOCAL_MEM_FENCE);

                      if (i<lengte-step && i==i<<div>>div) {

                          LP[i]+=LP[i+step];

                      }

                  }

           

          This will be nondeterministic on a gpu, as you're reading and writing the same addresses at the same time.  You need to go via an intermediary.

           

          Perhaps something like this:

          float lp = LP[i];

          for (int step=1, div=1; step<lengte; step<<=1, div++) {

                      barrier(CLK_LOCAL_MEM_FENCE);

                      if (i+step<lengte && i==i<<div>>div) {

                          lp += LP[i+step];

                      }

                       barrier(CLK_LOCAL_MEM_FENCE);

          // don't need to test range, since at worst this is nop.

                       LP[i] = lp;

                      }

                  }

           

          (ARGH why is cut and paste so borked in this forum editor)

           

          Infact, since that is the only place you access anything other than LP[i] you could probably kill LP entirely, use lp everywhere else, and then use tempfloat[] in this loop - local memory is precious.

           

          Another alternative is to double-buffer, i.e. always do two steps, the first writes from LP to tempfloat, the second from tempfloat to LP, which would halve the number of barriers required but require more LS.

           

          e.g. (not checked)

          float lp = LP[i];

          for (int step=1, div=1; step<lengte; step<<=1, div++) {

          // read LP write tmpfloat

                      barrier(CLK_LOCAL_MEM_FENCE);

                      if (i<lengte-step && i==i<<div>>div) {

                          lp+= LP[i+step];

                      }

          // need this for all threads since we need to copy over lp[i]

                      tmpfloat[i] = lp;

          // read tmpfloat write LP

                     step <<= 1;

                     div++;

                      barrier(CLK_LOCAL_MEM_FENCE);

          // note that we already have a valid lp[i] in lp

          // need to do an 'inner' loop test

                    if (step < lengte) {

                      if (i<lengte-step && i==i<<div>>div) {

                          lp += tmpfloat[i+step];

                      }

                      LP[i] = lp;

          // note that LP[i] is already in lp, for the next loop

                   }

                  }

                  }

           

          (sorry for the mess, but this editor is impossible, i'd much rather a completely simple HTML 1.0 'text area' widget)

           

          Other observations:

           

          You're using the global index to access local arrays, which is not a good idea even if you know the global and local worksizes are the same (incase you change your mind later).

           

          Although it seems just a convention, addressing by X rather than Y is also how the hardware does it, so it's probably a good idea to do that too, in the event you want to try solving 2d problems later.  If nothing else it makes the code easier for others to follow.  Your odd indentation and custom opencl wrapper is also a bit hard to follow too.

           

          Also, the following wont work, you need to use isnan()

          If (diff == NAN)

          per 6.3.e of the opencl1.1 manual, any comparision involving nan is false (that's got me plenty of times).

           

          error[] is only accessed by thread0, so you could just use plain variables.

           

          You don't need to check lp < 'lengte' or 'hoogte' either since you invoke the kernel with that size (unless in future you decided to change that for larger problems where the last workgroup might not be full.

          1 of 1 people found this helpful
            • Re: Numerical computations indeterministic?
              pwvdendr

              Sorry if my indentation is weird, I do Java programming in Netbeans and I tend to follow Netbeans' indentation rules.

              notzed wrote:

               

              I think this is your problem:

               

              for (int step=1, div=1; step<lengte; step<<=1, div++) {

                          barrier(CLK_LOCAL_MEM_FENCE);

                          if (i<lengte-step && i==i<<div>>div) {

                              LP[i]+=LP[i+step];

                          }

                      }

               

              This will be nondeterministic on a gpu, as you're reading and writing the same addresses at the same time.  You need to go via an intermediary.

               

              Hmm... I don't see how I would be reading and writing the same address there. The "i==i<<div>>div" is just a more efficient way of saying i%(2*step)==0, so the read/write indices never coincide. For example if LP has length 16, it would:

              • wait for local barrier
              • concurrently for i=0,2,4,6,8,10,12,14 do LP[i]+=LP[i+1] (since div=1 and step=1)
              • wait for local barrier
              • concurrently for i=0,4,8,12 do LP[i]+=LP[i+2] (since div=2 and step=2)
              • wait for local barier
              • concurrently for i=0,8 do LP[i]+=LP[i+4] (since div=3 and step=4)
              • wait for local barrier
              • for i=0 do LP[i]+=LP[i+8] (since div=4 and step=8)

              which does not seem to be concurrent anywhere.

               

              In the end, all that is done by that loop is counting the number of 1's in LP. At the end of that loop, LP[0] contains the sum of all original elements in the array.

               

              Also, the following wont work, you need to use isnan()

              If (diff == NAN)

              per 6.3.e of the opencl1.1 manual, any comparision involving nan is false (that's got me plenty of times).

               

              Thanks for the hint! Strange though, I'm running OpenCL 1.1 but it does work here for sure. But ok, if the better practice is to use isnan(), then I'll do that.

               

              As for the optimizations: this is indeed not optimized code. I just found that I could reproduce the problem in this small snippet of code, so I posted this. Most things you mention have their reasons (but they're cut away for simplicity of course). The NaN thing works here for me in OpenCL 1.1, but I'll change it to isnan() then. Thanks for the feedback!

                • Re: Numerical computations indeterministic?
                  notzed

                  Ahh, for some stupid reason I thought the<<div>> stuff was a paste-o.  Too early in the morning I guess.

                   

                  Shouldn't it be (i>>div) <<div if that's what you're trying to do?  i..e clear out the lower bits, not just do a dance up and down the bit indices.

                   

                  i.e. (i << div) >> div will always == i if highest_set_bit(i) < (32-div), wont it?

                   

                  If it is always true, the parallel prefix sum will still be correct on a cpu, and would work on an amd gpu if you had <=64 threads (which you should try to achieve if possible as the compiler would then remove the barriers entirely).