Threadgroups are equivalent to workgroups in OpenCL. It's just our internal terminology for the concept, since our hardware designs existed before OpenCL was standardized, and they work for other languages besides OpenCL.
In other words, you can put a wavefront to sleep using the S_SLEEP instruction. You can have up to 16 wavefronts in a workgroup, but the hardware will guarantee that they are all assigned to the same compute unit. When a wavefront runs S_SLEEP, the hardware will (temporarily) remove the wavefront from the list of "ready to run" wavefronts in that compute unit. This will give other wavefronts the opportunity to run.
Whenever a wavefront runs the S_WAKEUP instruction, the hardware will "ping" the other wavefronts in the workgroup (up to 15 of them) and add them back to the "ready to run" list if they were asleep. It can easily do this because all of those wavefronts must be in the same compute unit, so the logic is relatively locally contained.
Towards your second question: a wave may miss the ping because, for example, there is a data race between two wavefronts. Let's imagine that you write a lock where you try to grab the lock, then eventually run S_SLEEP if you do not get it (so that you don't inundate the hardware during your spin-lock). After leaving the lock you run S_WAKEUP to allow the other wavefronts in the workgroup to try to get the lock again without letting everything go idle.
One perfectly viable ordering that could happen is:
- Wavefront 0 acquires the lock
- Wavefront 1 fails to acquire the lock and spin-loops for a while.
- Wavefront 1 fails to acquire the lock again, and then the hardware decides to schedule wavefront 0 for the next few cycles (since the hardware scheduler regularly changes which wavefronts are running)
- Wavefront 0 completes its work in the critical section and releases the lock
- Very quickly afterwars (e.g. maybe the next cycle) Wavefront 0 runs S_WAKEUP
- A few cycles later, Wavefront 1 is scheduled by the HW again and (because it's gone around the spin-loop a few times) runs S_SLEEP, since its last instruction was a failure to grab the lock.
If S_SLEEP permanently put the thread to sleep until an S_WAKEUP was called, then this ordering would lead to Wavefront 1 never being woken up and completing. Wavefront 0's S_WAKEUP instruction happened *before* Wavefront 1 went to sleep. However, in our GCN hardware, the S_SLEEP on Wavefront 1 will eventually time out and the wavefront will wake up. This is what is being describe din that line of the manual.
As for how to use S_SLEEP/S_WAKEUP in barriers, I'll note that the simplest way to perform workgroup-wide barriers is to use the barrier / work_group_barrier operation. Inter-workgroup barriers are not necessarily guaranteed to work on GPUs, for various reasons that are formalized in this paper. You can try to make your own inter-workgroup barrier implementation, and I've done so in the past. But I work at AMD and am very familiar with our hardware limitations and what they look like in comparison to our software guarantees. And even then I mess things up a lot.