8 Replies Latest reply on Jul 29, 2011 10:35 AM by Bdot

    atomic_inc not reliable?

    Bdot
      1024 times atomic_inc results in ~800 increases

      Hi,

      How do I use atomic_inc correctly? I have a test program that either skips some of the atomic_inc's, or the function is not as atomic as it should be. Can someone point out the error?

      When the attached kernel is started with less than 10 threads, the resulting buffer contains

      1 1 1 1 0 0 0 ... 0

      With even more threads, it begins to increase, for 10-12 threads it looks like

      5 2 2 2 3 3 3 4 4 4 5 5 5 5 5 5 0 0 0 ... 0

      With increasing number of threads it appears to execute an average of about 80% of the atomic_inc's. There is no difference when using atom_inc instead.

      My expectation would be that the first number always contains the total number of threads (due to the atomic_inc being run once by each thread). The following number can be out-of-order, that's no problem. But in the example above, thread 5 seems to have written its results twice ...

      I'm running a HD5770 with Catalyst 11.6 on SuSE11.4/64-bit.

      Thanks

      #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define ATOMIC_INC(x) atomic_inc(&x) //#define ATOMIC_INC(x) atom_inc(&x) //#define ATOMIC_INC(x) ((x)++) __kernel void test_k( __global uint *res) { __local uint i,f; f = get_global_id(0); if (f==0) { for(i=0;i<32;i++) res[i]=0; } f++; // let the reported results start with 1 i=ATOMIC_INC(res[0]); if(i<10) /* limit to 10 results */ { res[i*3 + 1]=f; res[i*3 + 2]=f; res[i*3 + 3]=f; } }

        • atomic_inc not reliable?
          maximmoroz

          Try putting barrier(CLK_GLOBAL_MEM_FENCE); after f++;

          • atomic_inc not reliable?
            Meteorhead

            Either I do not really see the point of this example, or you're getting something very wrong. First of all, you're missing a barrier. First thread initializes result data, but the the rest of the threads rush through and already make output before result is initialized. (It does not cause any mistakes as I can see, but it is a mistake that will cause trouble later.)

            Second, I do not see how you would expect to see the number of threads in the first element... All threads give value to the same __local variable, that is used inside the first threads data initializer loop. Using __local variable inside a for loop is a) SLOW, b) NONSENSE.

            This kernel is bleeding from many wounds and I suggest a heavy rethink of when to use __private and __local variables. Generally, ONLY use __local when you want a share across threads. Do not try to reuse __local variables for matters that should really be stored in registers (__private).

              • atomic_inc not reliable?
                himanshu.gautam

                variable f is a local address space variable, but it is incremented by all the threads in the workgroup, which itself requires atomics. Also you should get the global id of threads in their private address space. So Simple solution should be to make f a private member.

                  • atomic_inc not reliable?
                    Bdot

                    Sorry, __local was meant to be __private (for both i and f). I changed that.

                    I also added the barrier(CLK_GLOBAL_MEM_FENCE) as suggested after f++.

                    The problem remains.

                    The kernel is just some test that I wrote after seeing a problem in my application which correctly uses __private vars. There, I also use a host-initialized in-out-buffer.

                     

                      • atomic_inc not reliable?
                        himanshu.gautam

                        Bdot,

                        I dont think using barrier(CLK_GLOBAL_MEM_FENCE) is of any use as f is now a register and every thread only modifies its own copy.

                        Can you explaing how you are checking the atomic behaviour. I think the assignments res[i*3 + 1] = f; ....

                        also have the atomic issue as again every thread will try to write to the same location.Correct should be like as attached. You can see the same value returned by all threads

                        #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable #define ATOMIC_INC(x) atomic_inc(&x) //#define ATOMIC_INC(x) atom_inc(&x) //#define ATOMIC_INC(x) ((x)++) __kernel void test_k( __global uint *res) { __local uint i,f; f = get_global_id(0); if (f==0) { for(i=0;i<32;i++) res[i]=0; } f++; // let the reported results start with 1 i=ATOMIC_INC(res[0]); printf("value at res[0] %d", res[0]); }

                          • atomic_inc not reliable?
                            Bdot

                            The intended behaviour is that each thread gets its own unique index i. In my application only a very small fraction of the threads ever reaches this point - most of the threads do not have any result to report. I want to save the bandwidth of copying a million int's back to the CPU when I usually have less than 3 results.

                            The atomic_inc is supposed to return the old value of res[0], which is then used by each thread as an index into the result buffer. The total number of threads that ever reached the atomic_int should be in res[0] after the kernel finished. Then I copy the buffer (clEnqueueReadBuffer) and examine the results on the CPU.

                            The res[i*3 + 1] = f; ... in this test kernel is just a placeholder so I can see which thread wrote those bytes. The real application will of course write the real result.

                             

                             

                              • atomic_inc not reliable?
                                afo

                                Hi,

                                I had a similar problem; I found that the "if" to check if the result needs to be reported gives a huge performance penalty in the computational thread; so I splitted the kernel in two: the first one computes all the results and stores them in a large enough buffer (in device memory). Then a second kernel takes the results one by one, check their usefullness and write the reportable ones in a second buffer where the offset is controlled with an atomic inc.

                                This thread could be usefull for you: http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=135097&forumid=9

                                By the way: it's better to initialize the buffer with zeros in CPU and transfer it to the GPU than check if get_global_id(0) is zero and initialize the buffer in the GPU (at least for me).

                                best regards,

                                Alfonso

                                  • atomic_inc not reliable?
                                    Bdot

                                    I just installed the 11.7 drivers, and now all problems vanished. Both atom_inc and atomic_inc now work as expected on both the CPU and the GPU.

                                     

                                    Thanks a lot for your help!

                                    I'll try the splitted kernel as well although I assume it will not speed up normal operation as my application most of the time has no results to report, sometimes 1 and very rarely more than one.