OpenCL

sp314 · ‎12-20-2018

The Vega Shader ISA doc (https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf) describes S_WAKEUP instruction as follows (I quote) -

Allow a wave to 'ping' all the other waves in its threadgroup to force them to wake up immediately from an S_SLEEP instruction. The ping is ignored if the waves are not sleeping. This allows for efficient polling on a memory location. The waves which are polling can sit in a long S_SLEEP between memory reads, but the wave which writes the value can tell them all to wake up early now that the data is available. This is useful for fBarrier implementations (speedup). This method is also safe from races because if any wave misses the ping, everything still works fine (waves which missed it just completes their normal S_SLEEP).

I understand the polling part and the need for the instruction, but then...

1 are threadgroups equivalent to workgroups in OpenCL, or is an OpenCL workgroup a threadgroup? Is this right?

2 I quote - if any wave misses the ping, everything still works fine (waves which missed it just completes their normal S_SLEEP). Why would a wave miss the ping? How often does that happen? Why?

3 are there any examples of using S_SLEEP/S_WAKEUP to implement a barrier/an fBarrier?

Thank you in advance. I found the answers to many of my questions on this forum, so I appreciate AMD setting it up, and I appreciate your help.

Best,

S.P.

jlgreathouse · ‎01-09-2019

Threadgroups are equivalent to workgroups in OpenCL. It's just our internal terminology for the concept, since our hardware designs existed before OpenCL was standardized, and they work for other languages besides OpenCL.

In other words, you can put a wavefront to sleep using the S_SLEEP instruction. You can have up to 16 wavefronts in a workgroup, but the hardware will guarantee that they are all assigned to the same compute unit. When a wavefront runs S_SLEEP, the hardware will (temporarily) remove the wavefront from the list of "ready to run" wavefronts in that compute unit. This will give other wavefronts the opportunity to run.

Whenever a wavefront runs the S_WAKEUP instruction, the hardware will "ping" the other wavefronts in the workgroup (up to 15 of them) and add them back to the "ready to run" list if they were asleep. It can easily do this because all of those wavefronts must be in the same compute unit, so the logic is relatively locally contained.

Towards your second question: a wave may miss the ping because, for example, there is a data race between two wavefronts. Let's imagine that you write a lock where you try to grab the lock, then eventually run S_SLEEP if you do not get it (so that you don't inundate the hardware during your spin-lock). After leaving the lock you run S_WAKEUP to allow the other wavefronts in the workgroup to try to get the lock again without letting everything go idle.

One perfectly viable ordering that could happen is:

Wavefront 0 acquires the lock
Wavefront 1 fails to acquire the lock and spin-loops for a while.
Wavefront 1 fails to acquire the lock again, and then the hardware decides to schedule wavefront 0 for the next few cycles (since the hardware scheduler regularly changes which wavefronts are running)
Wavefront 0 completes its work in the critical section and releases the lock
Very quickly afterwars (e.g. maybe the next cycle) Wavefront 0 runs S_WAKEUP
A few cycles later, Wavefront 1 is scheduled by the HW again and (because it's gone around the spin-loop a few times) runs S_SLEEP, since its last instruction was a failure to grab the lock.

If S_SLEEP permanently put the thread to sleep until an S_WAKEUP was called, then this ordering would lead to Wavefront 1 never being woken up and completing. Wavefront 0's S_WAKEUP instruction happened *before* Wavefront 1 went to sleep. However, in our GCN hardware, the S_SLEEP on Wavefront 1 will eventually time out and the wavefront will wake up. This is what is being describe din that line of the manual.

As for how to use S_SLEEP/S_WAKEUP in barriers, I'll note that the simplest way to perform workgroup-wide barriers is to use the barrier / work_group_barrier operation. Inter-workgroup barriers are not necessarily guaranteed to work on GPUs, for various reasons that are formalized in this paper. You can try to make your own inter-workgroup barrier implementation, and I've done so in the past. But I work at AMD and am very familiar with our hardware limitations and what they look like in comparison to our software guarantees. And even then I mess things up a lot.