when I do the atom_add, the kernel performance is about 16 ms. while if I did not do that, the kernel performance is about 1.6 ms. The atom_add() totally executed for about 1200 times. I think it is too much expensive. Is there any tricks to make the atom_add() operation faster? Or is there any way to make me group scattered data into a serialized one by one array without using atom operation?