I must say a good try.
But how workitems are executed inside a Compute Unit is entirely implementation dependent.Even if you run just 256 threads i.e one workgroup you cannot say whether all the 4 wavefronts will execute in a round robin approach or only one wavefront will remain stuck in the while loop & other will keep waiting for this wavefront.
Global Sync can only be implemented by using different kernels as for now.
I hope it is clear.
This will only work if you run exactly enough work groups to fill up the chip once or fewer. Say the device has N SIMD's and each work group takes up all the resources on the SIMD. A launch size of N work-groups should execute this code correctly. If N+1 work-groups are executed, the first N work-groups will loop waiting for scratch to hit nThreads. The last work-group will not get scheduled because the previous work-groups have not finished and there are no resources left.
Hope this makes sense on why the solution you have is not fully generic.
Yeah that makes perfect sense. I had never thought of it that way. I guess I had assumed (wrongly) that all work groups were executed concurrently.