2 Replies Latest reply on Feb 3, 2012 10:36 AM by CaptGreg

    A question about atomic_add


      I have a simple test for the atomic_add function as follow:


      #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

      __kernel void Atomic_op(global int *input, volatile global int *output)
          int val = input[get_global_id(0)];
          atomic_add(&output[0], val);



      __kernel void ssum(global int *input, volatile global int *output, int len)
      for (uint i = 0 ; i < len ; i++)
        output[0] += input[i];


      the function is used for sum the value of a large array. I used GPU device to run the kernel 1 with the global thread of 1024 * 1024, then It cost me 48ms(CPU device 68ms).

      for kernel 2, I enqueued a task to do the same job with the len 1024 * 1024 on CPU device, it cost me 4.1ms.

      then I write a c++ code to implement that, It cost 1.3ms.

      I don't find the advantage in atomic operation in my laptop(A6 3400). does anybody know that? if my gpu is too slow to do this work?

        • Re: A question about atomic_add

          It's just not a problem that makes sense solving in that way on the GPU. Atomics are not fast compared with non-atomic memory operations. Static accumulation in a loop will always be faster (and your second loop may well be compiled to use a temporary rather than accumulating to memory directly). The GPU might be faster if the memory bus and latency hiding on the GPU means that it accesses memory more efficiently, but it's inherently a memory bound problem so I wouldn't assume that.


          The most efficient way to do it would be to have each core perform a loop that sums some section of the array (much as your second example does) and then uses an atomic just once at the end of its computation to write that out. When you do that on a SIMD unit you want each lane to sum, then to add the lanes together (atomic in LDS or a reduction tree) and then one lane does a global atomic.

          • Re: A question about atomic_add

            Your   test program is working fine.


            You are essentially doing 1024*1024 adds.


            However using atomics,  one core  proceeds while all other cores  wait for that core to finish since the simple test program only does an atomic add.  All other cores want to do the same atomic at the same time.  Due to the overhead of sequencing one core at a time, 1024*1024 times, the test programs runs very, very slow.


            If your program is doing some real work and the time at which a core locks out all the other cores frm an atomic operation is randomized as a result all cores being busy doing something, atomic operastions proceed with very little overhead.

            1 of 1 people found this helpful