Archives Discussions

Bdot · ‎07-27-2011

1024 times atomic_inc results in ~800 increases

Hi,

How do I use atomic_inc correctly? I have a test program that either skips some of the atomic_inc's, or the function is not as atomic as it should be. Can someone point out the error?

When the attached kernel is started with less than 10 threads, the resulting buffer contains

1 1 1 1 0 0 0 ... 0

With even more threads, it begins to increase, for 10-12 threads it looks like

5 2 2 2 3 3 3 4 4 4 5 5 5 5 5 5 0 0 0 ... 0

With increasing number of threads it appears to execute an average of about 80% of the atomic_inc's. There is no difference when using atom_inc instead.

My expectation would be that the first number always contains the total number of threads (due to the atomic_inc being run once by each thread). The following number can be out-of-order, that's no problem. But in the example above, thread 5 seems to have written its results twice ...

I'm running a HD5770 with Catalyst 11.6 on SuSE11.4/64-bit.

Thanks

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define ATOMIC_INC(x) atomic_inc(&x) //#define ATOMIC_INC(x) atom_inc(&x) //#define ATOMIC_INC(x) ((x)++) __kernel void test_k( __global uint *res) { __local uint i,f; f = get_global_id(0); if (f==0) { for(i=0;i<32;i++) res=0; } f++; // let the reported results start with 1 i=ATOMIC_INC(res[0]); if(i<10) /* limit to 10 results */ { res[i*3 + 1]=f; res[i*3 + 2]=f; res[i*3 + 3]=f; } }

maximmoroz · ‎07-27-2011

Try putting barrier(CLK_GLOBAL_MEM_FENCE); after f++;

Meteorhead · ‎07-27-2011

Either I do not really see the point of this example, or you're getting something very wrong. First of all, you're missing a barrier. First thread initializes result data, but the the rest of the threads rush through and already make output before result is initialized. (It does not cause any mistakes as I can see, but it is a mistake that will cause trouble later.)

Second, I do not see how you would expect to see the number of threads in the first element... All threads give value to the same __local variable, that is used inside the first threads data initializer loop. Using __local variable inside a for loop is a) SLOW, b) NONSENSE.

This kernel is bleeding from many wounds and I suggest a heavy rethink of when to use __private and __local variables. Generally, ONLY use __local when you want a share across threads. Do not try to reuse __local variables for matters that should really be stored in registers (__private).

himanshu_gautam · ‎07-27-2011

variable f is a local address space variable, but it is incremented by all the threads in the workgroup, which itself requires atomics. Also you should get the global id of threads in their private address space. So Simple solution should be to make f a private member.

Bdot · ‎07-27-2011

Sorry, __local was meant to be __private (for both i and f). I changed that.

I also added the barrier(CLK_GLOBAL_MEM_FENCE) as suggested after f++.

The problem remains.

The kernel is just some test that I wrote after seeing a problem in my application which correctly uses __private vars. There, I also use a host-initialized in-out-buffer.

himanshu_gautam · ‎07-27-2011

Bdot,

I dont think using barrier(CLK_GLOBAL_MEM_FENCE) is of any use as f is now a register and every thread only modifies its own copy.

Can you explaing how you are checking the atomic behaviour. I think the assignments res[i*3 + 1] = f; ....

also have the atomic issue as again every thread will try to write to the same location.Correct should be like as attached. You can see the same value returned by all threads

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define ATOMIC_INC(x) atomic_inc(&x) //#define ATOMIC_INC(x) atom_inc(&x) //#define ATOMIC_INC(x) ((x)++) __kernel void test_k( __global uint *res) { __local uint i,f; f = get_global_id(0); if (f==0) { for(i=0;i<32;i++) res=0; } f++; // let the reported results start with 1 i=ATOMIC_INC(res[0]); printf("value at res[0] %d", res[0]); }

Bdot · ‎07-27-2011

The intended behaviour is that each thread gets its own unique index i. In my application only a very small fraction of the threads ever reaches this point - most of the threads do not have any result to report. I want to save the bandwidth of copying a million int's back to the CPU when I usually have less than 3 results.

The atomic_inc is supposed to return the old value of res[0], which is then used by each thread as an index into the result buffer. The total number of threads that ever reached the atomic_int should be in res[0] after the kernel finished. Then I copy the buffer (clEnqueueReadBuffer) and examine the results on the CPU.

The res[i*3 + 1] = f; ... in this test kernel is just a placeholder so I can see which thread wrote those bytes. The real application will of course write the real result.

afo · ‎07-27-2011

Hi,

I had a similar problem; I found that the "if" to check if the result needs to be reported gives a huge performance penalty in the computational thread; so I splitted the kernel in two: the first one computes all the results and stores them in a large enough buffer (in device memory). Then a second kernel takes the results one by one, check their usefullness and write the reportable ones in a second buffer where the offset is controlled with an atomic inc.

This thread could be usefull for you: http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=135097&forumid=9

By the way: it's better to initialize the buffer with zeros in CPU and transfer it to the GPU than check if get_global_id(0) is zero and initialize the buffer in the GPU (at least for me).

best regards,

Alfonso

Bdot · ‎07-29-2011

I just installed the 11.7 drivers, and now all problems vanished. Both atom_inc and atomic_inc now work as expected on both the CPU and the GPU.

Thanks a lot for your help!

I'll try the splitted kernel as well although I assume it will not speed up normal operation as my application most of the time has no results to report, sometimes 1 and very rarely more than one.

Archives Discussions

atomic_inc not reliable?