If i have a workgroup of 2 dimensions ( say 4x64 ), how are the workitems scheduled to execute? Does 1st row work-items execute first or first 16 columns?
Even if someone knew the answer, I would highly unadvise of using synchronisation of such type, as it is heavily device dependant. Wavefront execution should be considered hectic and expectation of calculations being made before/after others should only be assumed on workgroup level, backed by sync commands.
Let me tell you why. Future 7000 Radeons will hold 4 16 wide SIMD units in one Compute Unit. Thus, 4 wavefronts will execute at the same time. You have no knowledge at the moments, whether these 4 wavefronts come from the same workgroup, or they are members of different work-groups issued to the same CU.
Going below API supported syncing will render you application either unportable from OS-to-OS, vendor-to-vendor, or even device-to-device. One thing that works on your device, might not work the same way on another device of the same vendor, not to mention CPU-s that handle things yet again a whole lot differently.
I know that on NV cards there is a strict rule as to how warps execute, and I have seen code that uitilizes this implicit synchronization, but I HIGHLY unadvise to write code that relies on wavefront scheduling scheme. When writing paralell algorithms, consider that work-item execution is as hectic as possible within explicit boundaries imposed by your sync commands.
I would tend to agree with you there with the caveat that to get peak performance out of the hardware writing wave-centric code is necessary, barriers are too high overhead. This is for entirely the same reasons as is the case on nvidia chips. Of course, if you are going to risk doing that you have to realise that code is non-portable.
In answer to the OP I'm not entirely sure that "row" and "column" have a strict definition here. If we are thinking row-major, then get_local_id(0) is the column address; 0 will be more tightly issued into a wave than 1 and more tightly than 2. This is important information irrespective of synchronisation because of generating efficient addressing. When you generate loads you want (get_local_id(0)*4 bytes + base) as your address, and that's true in a 2D workgroup too.
I'm not sure that there is anything in the OpenCL specification that requires that, though, so optimisation for data locality is as device-specific as optimising for synchronisation. The difference being that it is correctness-portable if not performance-portable.
Retrieving data ...