Even if someone knew the answer, I would highly unadvise of using synchronisation of such type, as it is heavily device dependant. Wavefront execution should be considered hectic and expectation of calculations being made before/after others should only be assumed on workgroup level, backed by sync commands.
Let me tell you why. Future 7000 Radeons will hold 4 16 wide SIMD units in one Compute Unit. Thus, 4 wavefronts will execute at the same time. You have no knowledge at the moments, whether these 4 wavefronts come from the same workgroup, or they are members of different work-groups issued to the same CU.
Going below API supported syncing will render you application either unportable from OS-to-OS, vendor-to-vendor, or even device-to-device. One thing that works on your device, might not work the same way on another device of the same vendor, not to mention CPU-s that handle things yet again a whole lot differently.
I know that on NV cards there is a strict rule as to how warps execute, and I have seen code that uitilizes this implicit synchronization, but I HIGHLY unadvise to write code that relies on wavefront scheduling scheme. When writing paralell algorithms, consider that work-item execution is as hectic as possible within explicit boundaries imposed by your sync commands.