i hope you can help me with the following hardware depended scheduling problem. The number of active wavefronts of a kernel is limited by various hardware dependencies like the amount of scalar, vector registers or the local memory.
Also a set of threads, which are currently in idle state, consists at the GPU and reservs also hardware units. (Which ones?) This is necessary to schedule them fast in to hide the memory latencies. The threads, which do not fit into the GPU, even in idle mode, gets dispatched to the GPU when others finished theirs work.
So now there i have the understanding problem how the threads gets synchronized at a global barrier, when the threads at the GPU can't continue while the threads which are not yet dispatched to the GPU can't start? Is the consequence that the global barrier is only consistent for all threads living at the GPU, or does the driver do some tricks to schedule all around? I think the second solution would be very slow, if it is really done.
Please correct my understanding of the OpenCL work scheduling with AMD GPUs. I hope i described the problem clearly.
with best regards,