cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pwvdendr
Adept II

Numerical computations indeterministic?

Do all shader cores of a GPU handle numerical computations in exactly the same way? I've hit upon a strange phenomenon where I execute the same kernel several times on the same deterministic input, but the output varies by a slight fraction. This happens only on my GPU (HD5450), not when I executing on any of my (Intel, multicore) CPUs.

I've been thinking for a while that it should be some concurrency issue, but the fact that it does not seem to occur on CPU at all (even multicore) and the fact that after staring at this tiny kernel for many days I still don't see it, makes me wonder if there is no different explaining after all.

Asimple NetBeans project which triggers the instability on GPU can be found in this link: http://dl.dropbox.com/u/3060536/WeirdOpenCLbug.rar (the kernel is in ./src/bug.cl).

0 Likes
1 Solution

Ahh, for some stupid reason I thought the<<div>> stuff was a paste-o.  Too early in the morning I guess.

Shouldn't it be (i>>div) <<div if that's what you're trying to do?  i..e clear out the lower bits, not just do a dance up and down the bit indices.

i.e. (i << div) >> div will always == i if highest_set_bit(i) < (32-div), wont it?

If it is always true, the parallel prefix sum will still be correct on a cpu, and would work on an amd gpu if you had <=64 threads (which you should try to achieve if possible as the compiler would then remove the barriers entirely).

View solution in original post

5 Replies
notzed
Challenger

Yes - of course every thread gets identical ALU results.  There does seem to be some non-determinism wrt things like reading/writing the same local address though.

Its likely to be something to do with the barriers and local memory since cpu's dont have either they emulate it using completely serial code.  And/or reading uninitialised local memory which is very likely to change between invocations on a gpu but possibly unchanging on a cpu.

After about 30 minutes staring at this and getting lost in all those lookup tables, (i'm bored and avoiding housework) I think this is your problem:

for (int step=1, div=1; step<lengte; step<<=1, div++) {

            barrier(CLK_LOCAL_MEM_FENCE);

            if (i<lengte-step && i==i<<div>>div) {

                LP+=LP[i+step];

            }

        }

This will be nondeterministic on a gpu, as you're reading and writing the same addresses at the same time.  You need to go via an intermediary.

Perhaps something like this:

float lp = LP;

for (int step=1, div=1; step<lengte; step<<=1, div++) {

            barrier(CLK_LOCAL_MEM_FENCE);

            if (i+step<lengte && i==i<<div>>div) {

                lp += LP[i+step];

            }

             barrier(CLK_LOCAL_MEM_FENCE);

// don't need to test range, since at worst this is nop.

             LP = lp;

            }

        }

(ARGH why is cut and paste so borked in this forum editor)

Infact, since that is the only place you access anything other than LP you could probably kill LP entirely, use lp everywhere else, and then use tempfloat[] in this loop - local memory is precious.

Another alternative is to double-buffer, i.e. always do two steps, the first writes from LP to tempfloat, the second from tempfloat to LP, which would halve the number of barriers required but require more LS.

e.g. (not checked)

float lp = LP;

for (int step=1, div=1; step<lengte; step<<=1, div++) {

// read LP write tmpfloat

            barrier(CLK_LOCAL_MEM_FENCE);

            if (i<lengte-step && i==i<<div>>div) {

                lp+= LP[i+step];

            }

// need this for all threads since we need to copy over lp

            tmpfloat = lp;

// read tmpfloat write LP

           step <<= 1;

           div++;

            barrier(CLK_LOCAL_MEM_FENCE);

// note that we already have a valid lp in lp

// need to do an 'inner' loop test

          if (step < lengte) {

            if (i<lengte-step && i==i<<div>>div) {

                lp += tmpfloat[i+step];

            }

            LP = lp;

// note that LP is already in lp, for the next loop

         }

        }

        }

(sorry for the mess, but this editor is impossible, i'd much rather a completely simple HTML 1.0 'text area' widget)

Other observations:

You're using the global index to access local arrays, which is not a good idea even if you know the global and local worksizes are the same (incase you change your mind later).

Although it seems just a convention, addressing by X rather than Y is also how the hardware does it, so it's probably a good idea to do that too, in the event you want to try solving 2d problems later.  If nothing else it makes the code easier for others to follow.  Your odd indentation and custom opencl wrapper is also a bit hard to follow too.

Also, the following wont work, you need to use isnan()

If (diff == NAN)

per 6.3.e of the opencl1.1 manual, any comparision involving nan is false (that's got me plenty of times).

error[] is only accessed by thread0, so you could just use plain variables.

You don't need to check lp < 'lengte' or 'hoogte' either since you invoke the kernel with that size (unless in future you decided to change that for larger problems where the last workgroup might not be full.

Sorry if my indentation is weird, I do Java programming in Netbeans and I tend to follow Netbeans' indentation rules.

notzed wrote:

I think this is your problem:

for (int step=1, div=1; step<lengte; step<<=1, div++) {

            barrier(CLK_LOCAL_MEM_FENCE);

            if (i<lengte-step && i==i<<div>>div) {

                LP+=LP[i+step];

            }

        }

This will be nondeterministic on a gpu, as you're reading and writing the same addresses at the same time.  You need to go via an intermediary.

Hmm... I don't see how I would be reading and writing the same address there. The "i==i<<div>>div" is just a more efficient way of saying i%(2*step)==0, so the read/write indices never coincide. For example if LP has length 16, it would:

  • wait for local barrier
  • concurrently for i=0,2,4,6,8,10,12,14 do LP+=LP[i+1] (since div=1 and step=1)
  • wait for local barrier
  • concurrently for i=0,4,8,12 do LP+=LP[i+2] (since div=2 and step=2)
  • wait for local barier
  • concurrently for i=0,8 do LP+=LP[i+4] (since div=3 and step=4)
  • wait for local barrier
  • for i=0 do LP+=LP[i+8] (since div=4 and step=8)

which does not seem to be concurrent anywhere.

In the end, all that is done by that loop is counting the number of 1's in LP. At the end of that loop, LP[0] contains the sum of all original elements in the array.

Also, the following wont work, you need to use isnan()

If (diff == NAN)

per 6.3.e of the opencl1.1 manual, any comparision involving nan is false (that's got me plenty of times).

Thanks for the hint! Strange though, I'm running OpenCL 1.1 but it does work here for sure. But ok, if the better practice is to use isnan(), then I'll do that.

As for the optimizations: this is indeed not optimized code. I just found that I could reproduce the problem in this small snippet of code, so I posted this. Most things you mention have their reasons (but they're cut away for simplicity of course). The NaN thing works here for me in OpenCL 1.1, but I'll change it to isnan() then. Thanks for the feedback!

0 Likes

Ahh, for some stupid reason I thought the<<div>> stuff was a paste-o.  Too early in the morning I guess.

Shouldn't it be (i>>div) <<div if that's what you're trying to do?  i..e clear out the lower bits, not just do a dance up and down the bit indices.

i.e. (i << div) >> div will always == i if highest_set_bit(i) < (32-div), wont it?

If it is always true, the parallel prefix sum will still be correct on a cpu, and would work on an amd gpu if you had <=64 threads (which you should try to achieve if possible as the compiler would then remove the barriers entirely).

Oh really, did I switch them again. You're absolutely right, that should have been i==i>>div<<div.

Sorry, entirely my mistake. Sorry for the silly mistake and a huge thanks for pointing me to it!

As for the wavefront, yes, it would be possible to convert it to a 64-thread setup. Should this happen on all AMD GPUs? Since on my HD5450 (2 CU @ 40 cores/CU) if I only allow a smaller local work size of and handle several of the current work items per new work item, then the total work time seems to increase significantly, rather than decrease due to the disappearance of the barrier.

0 Likes

0 Likes