    Is this an effective way to achieve global synchroniation on the GPU?

      I implemented a barrier with atomic operations

      My initial testing seems to indicate that it works.

      Is there a way to do the same without forcing the compiler "complete path" memory mode?

      global uint sema = 0; if( get_local_id(0)==0 ) atomic_inc( sema ); while( sema % num_groups ) if( get_local_id(0)==0 ) atomic_add( sema, 0 );