Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

OpenCL atomic_add and atomic_inc not working correctly

atom_inc(system) does not atomically increase the value of local uint system[0], whereas atom_xchg(system, system[0] + 1) does.

I also saw this behaviour on clover running with LLVM 5.0

pocl 0.14 (which I use on the opteron CPUs) shows no difference, it runs on LLVM 4.0.1

does this look like an LLVM error? or is compiler related?

this piece of code:

if(!output[14]) output[14] = system[0] + 1;
if(!output[15]) output[15] = system[0];

outputs in gdb (this is the same on clover and amdgpu-pro running on LLVM 5.0!):

Breakpoint 1, worker (device_obj=0x609490) at ./engine.c:397

397                 if(answer[3] == 255) {

(gdb) print answer

$1 = {0, 0, 0, 0, 255, 276, 340, 804850955, 40962, 0, 0, 0, 0, 0, 1, 64}

(gdb) print answer[14]

$2 = 1

(gdb) print answer[15]

$3 = 64

output on pocl:

Breakpoint 1, worker (device_obj=0x609490) at ./engine.c:397

397                 if(answer[3] == 255) {

(gdb) print answer[14]

$1 = 1

(gdb) print answer[15]

$2 = 1


system details when running amdgpu-pro (also view the clinfo.txt)

Linux 4.10.17

GCC 7.1.0

LLVM/Clang 5.0.0

amdgpu-pro 17.30

pocl 0.14

experiment3.tgz contains the source, I inserted a debug function which dumps the private variables of workitem(0,0,0) to the output buffer. feel free to ask if you need it:

dump_global_output(const uchar* array, const uchar* array2, const int outputoffset, global uint* output)

takes the 4 first bytes from array and array2, and dumps mark(oxff), address(1), address(2) and content(1) and content(2) to output[offset]

As the output buffer is currently only 16 ints wide, and the program by itself needs output[0-3] to operate, I think you can only use 4 and 9 as output offsets.

dumps.tgz contain the CLOVER dump files (llvm, assembly) to show what happens when changing atom_xchg to atom_add or atom_inc. I do not know how to generate this on amdgpu-pro

to run the program, you need a linux system with pthreads and opencl installed.

To view the debug output, you need gdb.  follow these commands:

gdb ./a.out

break 397

print answer

good luck!

Message was edited by: janpieter sollie

2 Replies
Adept II

ok, it seems the error is mine here:

atomic functions are not unifying the instruction into one, they are serializing all of them to be executed at once. I thought the first.

this would explain the difference of running on cpu and gpu. can someone confirm?




After a quick look, it seems that outcome of the atomic operations has a dependency on their execution order (especially with work-items execution order). You should not assume any such order unless anything is explicitly defined. Actual execution order of work-items (as well as atomic instructions) may vary depending on implementation and hardware device.

On AMD implementation, work-items are executed very differently for CPU and GPU devices. CPU implementation runs work-items from the same work-group back-to-back on the same physical CPU core. That's may be the reason for observing the difference between CPU and GPU. However, you should not see these difference when atomics are used correctly.