Hi,
Yes, you should use atomics. As I know there is a special atomic instruction that use the faster GDS memory (atom_inc() vs. atomic_inc()), or maybe I know it wrong... Test both of then, anyways.
On GCN chips there is hardware global synchronization present. I've managed to synch 8 wavefronts/CU at 400KHz rate, so it's really fast, wasting only a few hundred cycles.
It's called Global Wave Synch (GWS). But unfortunately google can't find anything of it related to OpenCL.