When I call clEnqueueNDRangeKernel with globalThreads(A*a, B*b, C) and localThreads(a, b, 1), will it be gauranteed that the third dimension of the globalThreads will be changed only after the first and 2nd dimension numbers are all executed? More sepecifically, get_global_id(2)=1 will be scheduled only after all the variations in get_global_id(0) and get_global_id(1) are executed with get_global_id(2)=0.
The reason I am asking is, I want to have an order in the execution, we can not let global_id=1 run before global_id=0, and so on.
OpenCL doesn't define any execution of work items across between work groups. but i think that in AMD OpenCL programing guide there was some info about execution scheduling on GPU. but be warned that it is non portable and easily broke.
IMHO even when work items on Z axis are executed in predicable fashion there will be most likely moment when work items with Z=1 will begin execution where there is still some from Z=0
opencl's execution model is concurrent.
Apart from other obvious implications, this means there is no order guarantees. It seems quite obvious ...
atomics are used to guarantee certain local ordering, but not total ordering. global ordering can only be controlled by separate kernel invocations.
In your example, some hypothetical hardware could execute ALL work items for all indices at the same time, and be completely to specification. Another bit of hardware (e.g. single core cpu) could implement each work item one at a time, and be completely to specification. Or any combination in between.
The first hardware could not possibly conform to the ordering you're asking for, and the second need not but probably would.
Work items in OpenCL have no execution order guarantee. If you want to make some kind of guarantee of intra-thread execution within threads in the same work group you can use barriers such as barrier( CLK_LOCAL_MEM_FENCE ); for example.
If you are trying to make it so that all the threads in a slice of your problem finish before any thread in a different slice of your problem execute then the easiest way to achieve this is with a kernel execution per slice. You can't easily achieve this with a single kernel execution, but you can use atomic operations to impose some ordering on thread execution.
If you are convinced you need this pattern from one kernel, you can create some volatile global buffers which you initialize to 0 and then atomically increment in each work group's 0th work item before work group execution and atomically increment a different counter after completion (one pair of counters per slice). For work groups operating on a new slice that depend on prior work groups finishing execution on the previous slice, you'd need to wait until the completion atomic counter for the prior slice was full before executing the current workgroup, and you'd need to synchronize all the other local threads in the workgroup to wait on the 0th thread receiving the work information.