AnsweredAssumed Answered

A question about atomic_add

Question asked by catmoslin on Feb 3, 2012
Latest reply on Feb 3, 2012 by CaptGreg

I have a simple test for the atomic_add function as follow:


#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

__kernel void Atomic_op(global int *input, volatile global int *output)
    int val = input[get_global_id(0)];
    atomic_add(&output[0], val);



__kernel void ssum(global int *input, volatile global int *output, int len)
for (uint i = 0 ; i < len ; i++)
  output[0] += input[i];


the function is used for sum the value of a large array. I used GPU device to run the kernel 1 with the global thread of 1024 * 1024, then It cost me 48ms(CPU device 68ms).

for kernel 2, I enqueued a task to do the same job with the len 1024 * 1024 on CPU device, it cost me 4.1ms.

then I write a c++ code to implement that, It cost 1.3ms.

I don't find the advantage in atomic operation in my laptop(A6 3400). does anybody know that? if my gpu is too slow to do this work?