Originally posted by: zhuzxy Hello,
when I do the atom_add, the kernel performance is about 16 ms. while if I did not do that, the kernel performance is about 1.6 ms. The atom_add() totally executed for about 1200 times. I think it is too much expensive. Is there any tricks to make the atom_add() operation faster? Or is there any way to make me group scattered data into a serialized one by one array without using atom operation?
Global atomic operations are very expensive. Atomic counters are very fast but these are not supported on integrated GPUs. You can see AtomicCounters sample coming with SDK.
Try to use local atom_add if possible and sum on CPU.