cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tugrul_512bit
Adept III

Erroneous GPU behavior for atomic_add and atomic_sub

While running some OpenCL kernels on R7-240 and RX-550, I see this behavior:

float w_rnd(__global unsigned int * intSeeds,int id)               

{

                uint maxint=0;

                maxint--;

                uint rndint=w_hash(intSeeds[id]);

uint tmp0=atomic_add(&intSeeds[id],0);

atomic_sub(&intSeeds[id],tmp0);

atomic_add(&intSeeds[id],rndint);

                //intSeeds[id]=rndint;

                return ((float)rndint)/(float)maxint;

}

here, the commented-out line instead of 3 atomic functions works fine on a R7-240 but does not work fine on a RX550 so I tried these 3 atomics to make sure"value update" but this caused wrong values for the last 75 percent of buffer. I need the values updated because some kernels use random number generator more than once and I thought atomics could do this but they can not.

R7-240 works with both atomics version and simple version but Rx550 can't work with either version.

whash function is just a randomizer.

uint whash(uint seed)

{

                seed = (seed ^ 61) ^ (seed >> 16);

                seed *= 9;

                seed = seed ^ (seed >> 4);

                seed *= 0x27d4eb2d;

                seed = seed ^ (seed >> 15);

                return seed;

}

a kernel uses random number generator to pick a random value and then depending on that random value, it does unit-positioning on a map using random number generator two more times which makes 75% of units on a zigzag line instead of normally distributed on map. R7-240 uniformly distributes all units but RX-550 distributes only to 25% (upper-left quadrant) and leaves all remaining units on an inclined line. This is for atomics version.

Non-atomic version distributes better for RX550 but this time distributions are dependent on first usage of random number generator such that if first value is between 0 and 0.33 then it distributes right but all 0.34 and up to 0.99 (randomized) values cause map positionings wrong places and more narrow area of map instead of uniformly to whole map. So its like first usage of generator makes second and third usages generate a narrower range of values but works totally ok for R7-240. I suspect this is due to R7-240 not caching these values for read-only in kernel but RX-550 is caching and reading non-updated values, I may be wrong. Since OpenCL uses a relaxed-memory hierarchy, this is possible right?

What could be the hardware dependency affecting this?

Also If I generate random numbers 10 times more, it works for RX550 too. Is this working due to 10 times running same function forces a value update or is it simply a randomness issue? I use only thread-id values as seeds to start each random number generation independently so both R7-240 and RX-550 should do same with whash() function from same thread-id values and same device buffers.

Here is a kernel using this generator(commented-out for loop fixes the erroneous behavior, also I asked someone with a Vega card to try, same thing happened as RX550(and also this for-loop fix worked!)):

__kernel void initShips(__global float * shipX, __global float * shipY,

                        __global uchar * shipState, __global float * shipRotation,

                        __global uchar * shipTeam, __global unsigned int * randSeeds,

__global int * shipHp, __global int * shipShield)

{

    float mapWidth=@@mapWidth@@;

    float mapHeight=@@mapHeight@@;

    int i=get_global_id(0); // ship id

uchar team = (int)((w_rnd(randSeeds,i)*2.9999f));

/*

for(int j=0;j<10;j++)

{

team = (int)((w_rnd(randSeeds,i)*2.9999f));

}

*/

shipTeam=team;

if(team==0)

{

shipX=0.001f+w_rnd(randSeeds,i)*mapWidth*0.498f;

shipY=mapHeight*0.05f + wa_rnd(randSeeds,i)*mapHeight*0.90f;

shipRotation=w_rnd(randSeeds,i)*30.0f-15.0f;

}

else if(team==1)

{

shipX=mapWidth*0.501f+ w_rnd(randSeeds,i)*mapWidth*0.498f;

shipY=mapHeight*0.5f + w_rnd(randSeeds,i)*mapHeight*0.5f;

shipRotation=w_rnd(randSeeds,i)*30.0f+165.0f;

}

else if(team==2)

{

shipX=mapWidth*0.501f+w_rnd(randSeeds,i)*mapWidth*0.498f;

shipY=w_rnd(randSeeds,i)*mapHeight*0.5f;

shipRotationrnd(randSeeds,i)*30.0f+165.0f;

}

   }

Tags (1)
0 Likes
19 Replies
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Thanks for reporting it.

First, please help me to understand the issue. From the above description, it seems that w_rand() is the main suspected function and there are primarily two issues observed on Rx550:

  1. Non atomic version i.e direct assignment not working
  2. Atomic version works for initial values, but then produces erroneous result

Please correct or add anything important if I missed. Also, please mention about OS and driver version. It would be really helpful if you could share the host-code and a reference to verify the correct values.

One thing about the atomic version. Did you check with atomic_xchg() instead of atomic_add and atomic_sub?

Btw, testing w_rand() function with a simple host-code, it looks like both versions of code working fine on Cazzio as well as Hawaii.

Regards,

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

1. Direct assignment is not working for 67% of a buffer (ship x-y coordinate buffers), in a way that first usage(direct assignment) affects those so distribution is corrupt and not giving 0-1 range but more like 0.7 - 0.9 range so alls ships get very crowded around that area of map if first assignment results in a team=1 or team=2 (this is at least not so bad since there is still some randomness (but different for each team, weirdly)). By first usage, I mean inside of a kernel, calling generator multiple times and mentioning the first gen.function call in kernel.

2. Atomic version is also similar(again, for team=0), but this time team-1 and team-2 have same x-y values{(23,23)  (24.5,24.5) (8100,8100)} hence they make a line instead of a cloud. I guess y values just overrride x values or the opposite or they somehow get two consequently same values which is very low probability to be seen on consecutively 670000 ships which is obviously much worse than case (1.) .

I just installed CodeXL 1.9 and see no errors related to any buffer or any kernel, it just complains about device not found since check all kinds of devices GPU,CPU,ACC distinctively and get only successfully returned devices.

Editor here bugged somehow and stopped me editing article further so I coouldn't see rest of kernel here(I did include system specs, they're gone), even for editing.

64-bit Windows-10 fall creators update, all updates done. Driver: 17.11.1. Project is 64-bit compiled.

You can find kernel file in this rar. KaloriferBenchmarkGPU/EpicWarCL017.rar at master · tugrul512bit/KaloriferBenchmarkGPU · GitHub

The software design of project in this link is so horrible that I feel embarrassment to give source code. But all OpenCL kernels are in cl file. R7-240 fills map as intended, but RX550 nor Vega can't do same unless the mentioned dummy for loop is added.

R7-240(both normal and atomics make this):

Ih2v9y.jpg

RX-550 / VEGA normal:

FPx8wq.jpg

RX550 Atomics: they start as thin lines then collisions make them look like this. I think even first team gets somewhat unexpected values but randomized at least.

PqbxoL.jpg

So just because their team(and indirectly color) is different, their x,y positions become like this and using atomics make green and blue two inclined lines instead of a dense cloud.

Best Regards,

0 Likes
dragontamer
Journeyman III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Hey tugrul_512bit,

I'm just a hobbyist, but I think you're making a key beginner's mistake in your "atomics" code. Atomics do NOT provide ordering of memory operations. Ordering is provided by mem_fence (see: mem_fence).

The three atomics you wrote:

  1. uint tmp0=atomic_add(&intSeeds[id],0); 
  2. atomic_sub(&intSeeds[id],tmp0); 
  3. atomic_add(&intSeeds[id],rndint); 

Without a memory fence, these three atomics can theoretically execute in any order (when viewed from the lens of another thread), and therefore are not going to be the protection you want. Again, atomics do NOT provide any guarantees about ordering! If you used a single atomic (such as atom_cmpxchg ), you'd have stronger guarantees.

With that said, I don't see how your code ever synchronizes, since it all accesses intSeeds[id] and "id" is a value based on get_global_id(0). So the problem ultimately remains a mystery to me. But I just felt like chiming in about the memory fence issue. To have properly synchronized code, you need to correctly be using atomics AND fences. With that said, I don't see any reason why things would be different on the two cards.

EDIT:

>>>  Here is a kernel using this generator(commented-out for loop fixes the erroneous behavior, also I asked someone with a Vega card to try, same thing happened as RX550(and also this for-loop fix worked!)):

Honestly, I've never seen a random number generator like "whash" before. Where did you get it from? And why aren't you using something more standard like an LCGRNG? (See: this Linear congruential generator - Wikipedia ). Some random number generators require 'ramp up' time to get going. Since you are only using each RNG 4 or 5 times and then returning, the RNGs on the RX550 / Vega might be the correct behavior. IE: its not random yet.

I'll have to play around with this later to really see whats going on. Have you verified the whash generator on both the 240 and 550?

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

I tried this:

                uint maxint=0;

                maxint--;

                uint rndint=w_hash(intSeeds[id]);

mem_fence(CLK_GLOBAL_MEM_FENCE);

uint tmp0=atomic_add(&intSeeds[id],0);

mem_fence(CLK_GLOBAL_MEM_FENCE);

atomic_sub(&intSeeds[id],tmp0);

mem_fence(CLK_GLOBAL_MEM_FENCE);

atomic_add(&intSeeds[id],rndint);

mem_fence(CLK_GLOBAL_MEM_FENCE);

                //intSeeds[id]=rndint;

                return ((float)rndint)/(float)maxint;

does same wrong distribution(lines) again as before(for RX550) and works for for R7-240(again).

Thank you for this knowledge you share, what can I do for the simple assignment version? I think compiler shouldn't re-order assignments if there are dependency right?

Edit: Thomas Whang is writer of algorithm. Original link: http://www.concentric.net/~Ttwang/tech/inthash.htm, Taken from: http://web.archive.or...  and Quick And Easy GPU Random Numbers In D3D11 – Nathan Reed’s coding blog

Edit-2: The R7-240 GPU does what I intended to do, gets a distribution(Normal or Gaussion, doesn't matter as it is a game) that I needed.

Edit-3: mem_fence() doesn't fix simple assignment version neighter.

Edit-4: if compiler can freely reorder everything, what can I do for, for example a Kahan-Adder that I use in a scientific calculation? Does it get reordered so Kahan-Addition is actually broken too? In G++, I used assembly inline blocks to counter G++'s free reorderings. Whats the key for OpenCL version of Kahan Addition, for example? Should I expect mem_fence do the job between Kahan-Addition instructions?

Edit-5: In this project, there are codes like these:

int something = atomic_add(somethingElse,foo);

if(something)

{

     int somethingDifferent=atomic_add(things,bar);

      ...

}

does this force a priority that something gets calculated first, always?

Edit-6: What if there is an agressive optimization similar to reordering here:

  1.                 seed = (seed ^ 61) ^ (seed >> 16); 
  2.                 seed *= 9
  3.                 seed = seed ^ (seed >> 4); 
  4.                 seed *= 0x27d4eb2d
  5.                 seed = seed ^ (seed >> 15); 
  6.                 return seed; 

how do I make sure private register operations are done serially with exact numbers given here? Fence operations work only for local and global memories. Do we always need assembler/asm blocks to keep order intact (especially for science)? 

Assuming somehow I get binaries of compiled kernels(offline compiled). Does GPU change those codes(reordering, optimizing) on-the-fly?

0 Likes
dragontamer
Journeyman III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Hey tugrul_512bit,

To be honest, I haven't studied the OpenCL 1.2 atomics at all. So I don't really know the "proper" way to write a CAS loop using the 1.2 version of the atomics. This stuff is also new to me, but I think its good practice if I try to explain it to someone else. I hope you don't mind my own inexperience in this matter...

> I think compiler shouldn't re-order assignments if there are dependency right?

Its more than the compiler: its also the memory subsystem (cache and write-back buffers especially) and CPU (I don't think AMD's processor is out-of-order... but x86 for example executes instructions out of order sometimes). So the job of the programmer is to say when reorderings are a potentially bad thing, and when they're okay.

> if compiler can freely reorder everything

The compilers / CPUs / Caches / Memory don't reorder "anything". They reorder in particular ways that make our code faster. Memory fences exist so that the programmer can say "Hey, don't do that". From a single-threaded perspective, the compiler / CPU / cache can reorder anything as long as the thread looks the same later on. But for multithreaded code, the reorderings may mess with the logic elsewhere (ie: another thread is looking at the memory and RELIES on the order in which the memory changes).

If a programmer places a memory_fence, then it tells the Compiler / CPU / cache to not reorder anything (causing a drop in performance). And acquire fences and release fences allow reorderings in such a way that we probably don't care about. (IE: Only if there's a data-race would the programmer be able to notice a difference).

So ultimately... no, that's not how you are supposed to use fences. Roughly speaking, a memory fence is used to correlate to a "Mutex Lock" and a "Mutex unlock". It says "Yo Compiler / CPU / Cache: please don't move the memory stores/loads through this fence. I'm writing a Mutex here and its really, really important that the following memory loads / stores are done in the correct order, because another thread is watching you".

Does that make sense? Under single-threaded conditions, memory fences and atomic operations don't really do anything.

-------------

> how do I make sure private register operations are done serially with exact numbers given here?

They will be done in the exact order from a single-threaded perspective. The main issue is when you're talking to SLOW global memory, the transfer from L1 cache -> Main Memory may make some of those operations look like they happened out of order.

So from a single thread, everything that you write in your program should look the same (unless there's a compiler bug). Memory fences and Atomics exist when you have multiple threads talking using the same memory location. Because at that point, memory is so slow and CPUs / GPUs are so fast... that the Main Memory isn't updated fast enough to keep up with all of the memory changes. Indeed, memory is coalesced, batched up, distributed, cut up, and put back together again.

-------------

In short: it doesn't seem like your code "depends on other threads", so I'm not sure if you need fences or atomics at all. Sorry for the wild goose chase...

-------------

Now, to fully answer your question... the proper way to write a CAS loop is described in this blog post: You Can Do Any Kind of Atomic Read-Modify-Write Operation

Unfortunately, that's using the C++11 CompareAndExchange instruction (which only exists in OpenCL 2.0). I'm sure there's a way to do it correctly in OpenCL 1.2, but I'm not confident in my ability to write that for you...

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Thank you Dragontamer,

If I do

a[id]=5;

b[id]=a[id];

should I not expect 5 in element of b, always, especially in relaxed memory of OpenCL? (edit: a and b are global)

It seems having a kernel that only generates(using only once per kernel repeat) random numbers and a separate kernel that uses them is more suitable to be stable on different GPUs.

Then I should generate random number using private registers then pass it to other generators in a chain-input-return way until all generations use same private register and write the result seed value(to global array) only once and only at the end of kernel.

Best regards,

0 Likes
dragontamer
Journeyman III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

The problem is that "Between" line 1 and 2, ANOTHER thread or work-item could have changed the value of a[id].

If you can guarantee me that NO ONE ELSE touches a[id], then b[id] == 5 and all is fine in the world. The problem, is that there's thousands of cores working simultaneously, updating memory in complicated ways. Even the CPU could theoretically be changing the value of a[id] for example, since global memory is theoretically accessible by the host!

That's what it means to have a parallel computer. Other cores might be changing your memory, "in between" your lines of code! So first, you have to promise me that no other core (including the CPU) is touching a[id].

If you can promise me that, then b[id] == 5.

-----------

EDIT:

> Then I should generate random number using private registers then pass it to other generators in a chain-input-return way until all generations use same private register and write the result seed value(to global array) only once and only at the end of kernel.

I think as long as you can ensure that each work item only touches "its personal seed", you should be fine. The main issue is synchronization. Its fine to have a bunch of seeds in global memory, as long as no other thread is messing with them. Work Item #1 gets seed #1, Work Item #2 gets seed #2... etc. etc.

As long as you work in that scheme, you shouldn't have a synchronization issue.

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Yes, I guarantee no workitem ever tries to access that element. Perfectly parallel, 1-to-1 data flow. (edit: and yes, personal random number generators for all ships, edit-2: I don't add time factor to seeds so whole simulations must act same for every new simulation)

0 Likes
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

My graphics driver stopped responding when I ran the above executable and clicked the detected device. Not sure if anything to configure before running the application.

I think it would be better if we focus on the main suspected area of the code that you pointed out. I tried to create a simple test-case that could be used as a repro for this case. Please find the attachment.

The test-case compares the result produced by the device with CPU result and prints whether there is a match or not. It's very simple. Please try both direct-assignment and atomic version in the kernel.

As per your above description, the code should produce a match on R7 240 but a mismatch on RX550.

0 Likes