Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

How can I force a workgroup to be swapped out for another?

I am working on partially ordering work group execution order using atomics and need a way to instruct one work group to wait on another workgroup. To achieve this, the waiting work group busy waits on an atomic value which the other work group will set. However, if the two work groups are assigned to the same SIMD queue then the waiting work group may block the work group it is waiting on thereby causing a deadlock. Thus, in the busy loop, the work group needs a way to say "swap me out and let other work groups take my place", i.e. yield to other workgroups.

Is there a way to do this on AMD hardware? I know this is not OpenCL spec stuff so I'm just playing around at the moment.

5 Replies

This thing gives me the creeps. You're forcing a native massively parallel environment to a pipeline-like operation.

I encourage you in rethinking your strategy. If there's a producer-consumer relationship odds are you'll just be better with flowing the data in the same WI.

BUT if you want to go ahead this way maybe you could branch on the atomic. This will make the designed workgroup trash execution time (hopefully you can enforce it in doing something useful).

That said, rethink your strategy. Work groups should be free to go as they want.


I would love to let the work groups run free but the problem cannot be expressed like that I'm trying to think of a better way but no luck yet. For now, I'm having to do multiple (1000s) of enqueues which kills performance. Doing the calculation on CPU is fast but dragging the data from the GPU then sending it back also kills performance.

MicahVillmow - any ideas to pervert the scheduling of work groups/waves?


Sounds like you're trying something that is so granular that the PCI communication kills performance. Enforcing switching out a work-group is generally a very bad idea. For one, it is not portable, neither it is guaranteed that the solution will work on a future driver and/or hardware. I would also suggest a rethink.

The little more constructive approach would lead me saying: have you considered using OpenCL 2.0 and restrict yourself to hardware that is actually meant for this type of stuff? Handling O(1000) enqueues with dependencies encoded as atomics and have GPU kernels react to these atomic changes... sounds very much like work for Kaveri/Skylake or any architecture that implements platform-level atomics. Read the OpenCL 2.0 specs, specially the part about various levels of SVM, where the highest level of SVM support is atomically-correct SVM. I suggest you rethink your workflow in a manner where atomically-correct SVM is the minimal requirement to your application actually working.

Without the actual hardware, you can always start developing with a CPU device, as host and CPU device can surely share atomically correct SVM memory. Your kernels will be slower, moreover without explicit use of sub-devices they will slow down the host process, nonetheless you can prove the correctness of your application.

Also, with OpenCL 2.0 you get features like dyanamic parallelism, which might completely mitigate the need of atomic control of kernels.


From GCN standpoint, two waves should not be blocking one another even if they are scheduled on the same SIMD.

You can read more about it in Layla Mah's excellent presentation : "GCN Crash Course". (slides 26-27 , though it does not explicitly address the question.)

In general, each wavefront has its own program counter and can be executing a different instruction even if scheduled on the same SIMD. When scheduled on the same SIMD two Wavefronts can't execute the same type of instruction on the same cycle but they do overlap.

BTW, if you want faster communication between wavefronts I suggest you use the 'cl_ext_atomic_counters_32' extension. It uses the GDS memory bank which is an order of magnitude faster than global memory.


Where's this kind of stuff documented? Such as the cl_ext_atomic_counters_32 extension being mapped to GDS? I have been waiting for this feature for ages. Is GWS implemented already, just not yet advertised?