I have a simple test for the atomic_add function as follow:
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
__kernel void Atomic_op(global int *input, volatile global int *output)
int val = input[get_global_id(0)];
__kernel void ssum(global int *input, volatile global int *output, int len)
for (uint i = 0 ; i < len ; i++)
output += input[i];
the function is used for sum the value of a large array. I used GPU device to run the kernel 1 with the global thread of 1024 * 1024, then It cost me 48ms(CPU device 68ms).
for kernel 2, I enqueued a task to do the same job with the len 1024 * 1024 on CPU device, it cost me 4.1ms.
then I write a c++ code to implement that, It cost 1.3ms.
I don't find the advantage in atomic operation in my laptop(A6 3400). does anybody know that? if my gpu is too slow to do this work?