cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

I run the attachment at it said "Matched" for all cases.  Then I made simple additions to use it 3 times per workitem(as in project, team -> x and y). It said "Not Match" only for RX550.

Here are the changes I've done:

I don't know why but only chaning generator name permits me put codes here:

C++:

ELE_TYPE w_hash(ELE_TYPE seed)

{

seed = (seed ^ 61) ^ (seed >> 16);

seed *= 9;

seed = seed ^ (seed >> 4);

seed *= 0x27d4eb2d;

seed = seed ^ (seed >> 15);

return seed;

}

C++ - test part:

bool isMatched = true;

unsigned int maxint = 0; maxint--;

for (int i = 0; i < NUM_ELEMENTS; ++i) {

unsigned int testVal = w_hash(hostBuff);

if (testVal < maxint / 2)

{

testVal = w_hash(w_hash(w_hash(hostBuff)));

}

if (testVal != devBuff) {

isMatched = false;

break;

}

}

and in kernel:

uint w_rnd_direct(__global unsigned int * intSeeds, int id)               

{

    //uint maxint=0;

    //maxint--;

    uint rndint=w_hash(intSeeds[id]);

    intSeeds[id]=rndint;

return rndint;

}

uint w_rnd_atomic(__global unsigned int * intSeeds, int id)               

{

    //uint maxint=0;

    //maxint--;

    uint rndint=w_hash(intSeeds[id]);

    uint tmp0=atomic_add(&intSeeds[id],0); 

    atomic_sub(&intSeeds[id],tmp0); 

    atomic_add(&intSeeds[id],rndint);

return rndint;

}

void kernel testkernel(global ELE_TYPE *intSeeds)

{

int id = get_global_id(0);

uint maxint=0;

    maxint--;

/*

uint testVal=w_rnd_direct(intSeeds, id);

if(testVal<(maxint/2)){

w_rnd_direct(intSeeds, id);

w_rnd_direct(intSeeds, id);

}

*/

uint testVal=w_rnd_atomic(intSeeds, id);

if(testVal<(maxint/2)){

w_rnd_atomic(intSeeds, id);

w_rnd_atomic(intSeeds, id);

}

}

strangely, direct assignment version works this time. Also I tested in project that uses only private registers and writes only once to global at the end, both cards produce desired output.

In my opinion, direct version gets wrong only when three random values are saved to global arrays (so that optimizer can't omit somethings maybe).

0 Likes
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Looks interesting. I don't have a Rx550 to verify myself. Could you please try below options and share your findings?

  1. atomic_xchg instead of atomic_add and atomic_sub
  2. A global memfence after each w_rnd_atomic
  3. Compilation with -O0
0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

atomic_xchg failed, -O0 succeeded and also mem_fence(CLK_GLOBAL_MEM_FENCE) between all generator calls succeeded. All tests wwere done independently. Also adding fences inside of generator works too! (again, all for RX550. R7 240 runs fine with all of these)

I don't know if it helps but, in the project, I didn't use "-cl-std=CL1.2". Is this needed always? Maybe it is the reason for the Nvidia cards unable to run my project?

0 Likes
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Thanks for sharing these findings. I suspected something like that. It means that use of memfence in between w_rnd_atomic calls keeps order and visibility of the atomic operations as expected and prevents the compiler from any reordering. The compiler tool-chain used for R7 240 is different so, it might be worked there. I'll still verify with the compiler team.

Usage of "-cl-std" flag is optional so you can avoid it. I copied the host code from one of my test-projects where the flag was used to compile the kernel for multiple versions.

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Thank you very much for helping. I'll start to convert global versions to register version(instead of using atomics on globals) and do the writing to global only once per kernel, this should be working for all GPUs. I always thought fence functions were for other workitems to see a workitems change on global/local, didn't think about reordering. I was going to compare this to nvidia cards(to see if those act same as R7-240 or RX550) but this doesn't work and gives cl_invalid_command_queue error(according to an nvidia user) even though queue/context/buffer/ndrange commands are all fine for the codexl1.9.

0 Likes
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Here is a nice thread barrier vs mem_fence?  where MicahVillmow explained about the mem_fence very clearly as below:

As for a fence operation:

A fence operation instructs the compiler not to reorder any memory instructions around the fence instruction. There is no synchronization done, so on a mem_fence instruction, there is no guarantee that any load/store from another work-item to either local or global memory is visible to the current work-item. The only guarantee of mem_fence is that loads/stores before the fence will be executed before load/stores after the fence. Memory consistency in OpenCL is only guaranteed within a work-item and that work-item is unique in it's access of memory throughout the NDRange. The only exception is synchronization on the local address space by work-items in a work group via the barrier instruction.

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Thank you again. So, after compiler team verifies, then its my fault to not put fences between all generator(simple direct assignment or atomic) calls right? I'm sorry.

0 Likes
dipak
Staff
Staff

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Nothing to say sorry here. Sometimes it is little bit tricky to understand the behavior. As said earlier, I'll still verify the same from the compiler team and confirm you.

0 Likes
dragontamer
Journeyman III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

I'm not sure if this is a compiler bug or not yet. I haven't really looked very deeply into your code.

But the idea of a memory_fence is to provide the ability for higher-level programs to create mutexes... AND for the compiler / CPU / cache to provide optimizations that are beneficial to the programmer. Communication between the programmer and the compiler+system is important.

Lets look at some very basic Mutex code.

x=0; // At the start of the program

....

// Later in some thread

x++;

mutex.lock();

x++;

mutex.unlock();

x++;

In this case, lets assume the compiler and the CPU does nothing (for simplicity. The compiler definitely wants to move code around, but cache is all I need for this theoretical situation). Lets look at this purely from a cache level. The cache wants to reorder your code, because "x" is being written to multiple times. So the cache wants to update x "all at once", because its very expensive to talk to main memory!

Lets say "mutex" is in main memory, while "x" is in cache. Therefore, the cache will want to do the following:

x++;

x++;

x++;

mutex.lock();

mutex.unlock();

The cache WANTS to do this, but if it did, then your code would be incorrect. Clearly, the x++ needs to at very least be on the "inside" of the locked region, the so called "critical section". How do we communicate this idea of reorderings to the cache, while still giving the cache (or any other part of the system) the ability to make "sane" optimizations?

Well, any good "mutex.lock()" will put an "acquire fence". And any good "mutex.unlock" will be a "release fence". Roughly speaking, an "acquire fence" means that the following section of code is acquiring a critical section. While the "release fence" means that the previous lines of code are "releasing" the critical section. Of course, a mutex does more than just fences, so there's still lots of other code to be done.

x++;

mutex.lock(); // This is an acquire fence internally

x++;

mutex.unlock(); // This is a release fence internally

x++;

Now, the middle x++ cannot move! Because we have a properly specified memory fence.

But... the compiler / cache / CPU is STILL free to reorder the instructions to some degree, to improve the speed of our code. In this case, what will most likely be done is:

mutex.lock();

x++; // The first ++ is "moved down"

x++; // The original ++

x++; // The final ++ is "moved up"

mutex.unlock();

In this case, we were writing x++ "outside" of the critical section, the 1st and 3rd ones. Therefore, the first and 3rd x++ instructions are a data-race. Other threads might see any value of x, from 1, 2, or 3, and still be a valid from. The compiler / CPU / cache systems SHOULD be able to "pick any data race" and choose the fastest one. In this case...

mutex.lock();

x+=3;

mutex.unlock();

This is clearly the fastest implementation that's still correct. The compiler merged all the ++ instructions, and is simply defining that "this thread wins the undefined data-race, so other threads will always see x=3".

Does that make sense? The formal definition is more formal. But the "basic idea" of this whole setup is to make it easy to write mutexes. One fence at the start of a critical section, and one fence at the end of a critical section. The "mutex logic" still needs to be implemented with atomics however (!!!). Because synchronization is more than just juggling the potential reorderings of memory. Furthermore, a fence technically still needs an atomic to be "relative" towards to be properly defined. The fence defines which instructions can't move, while the atomic defines what the instructions can't move past.

x++;

y++;

z++;

atomic_work_item_fence(CLK_GLOBAL_MEM_FENCE, memory_order_acquire, memory_scope_device); // This fence starts a critical section below it

while(!atomic_compare_exchange_weak(mutex_variable, 0, 1)); // Simple CAS-based mutex. Keep trying to turn the 0-flag into a 1. Failure means someone else is in the critical section

// This code is now protected by the acquire fence. None of the code written here can move "above" the "atomic_compare_exchange_weak" instruction.

Notice that x, y, and z are NOT protected by the acquire fence. Becacuse they're in a "data-race" area. If they were moved into the critical section, then the code is still correct. All of this post describes OpenCL 2.0 atomics and fences (as well as C++ atomics). Again, I don't know how to use OpenCL1.2 atomics, but the same idea probably exists for it.

0 Likes
tugrul_512bit
Adept III

Re: Erroneous GPU behavior for atomic_add and atomic_sub

Yes, this makes sense. Especially for the random number generators here. 3 calls can be changed between themselves since I don't care which one is which as long as they do randomize. Similar to x+=3. I learn a new thing everyday but I wish I don't forget much. Very enlightening.

So much responsibility for controlling compiler/cache/cpu reorderings! But is fun.

0 Likes