opencl's execution model is concurrent.
Apart from other obvious implications, this means there is no order guarantees. It seems quite obvious ...
atomics are used to guarantee certain local ordering, but not total ordering. global ordering can only be controlled by separate kernel invocations.
In your example, some hypothetical hardware could execute ALL work items for all indices at the same time, and be completely to specification. Another bit of hardware (e.g. single core cpu) could implement each work item one at a time, and be completely to specification. Or any combination in between.
The first hardware could not possibly conform to the ordering you're asking for, and the second need not but probably would.