It's just not a problem that makes sense solving in that way on the GPU. Atomics are not fast compared with non-atomic memory operations. Static accumulation in a loop will always be faster (and your second loop may well be compiled to use a temporary rather than accumulating to memory directly). The GPU might be faster if the memory bus and latency hiding on the GPU means that it accesses memory more efficiently, but it's inherently a memory bound problem so I wouldn't assume that.
The most efficient way to do it would be to have each core perform a loop that sums some section of the array (much as your second example does) and then uses an atomic just once at the end of its computation to write that out. When you do that on a SIMD unit you want each lane to sum, then to add the lanes together (atomic in LDS or a reduction tree) and then one lane does a global atomic.
1 of 1 people found this helpful
Your test program is working fine.
You are essentially doing 1024*1024 adds.
However using atomics, one core proceeds while all other cores wait for that core to finish since the simple test program only does an atomic add. All other cores want to do the same atomic at the same time. Due to the overhead of sequencing one core at a time, 1024*1024 times, the test programs runs very, very slow.
If your program is doing some real work and the time at which a core locks out all the other cores frm an atomic operation is randomized as a result all cores being busy doing something, atomic operastions proceed with very little overhead.