Hi,
"We have been running into brick walls when it comes to the hardware, mostly to do with async shader performance (they are not true hardware controlled)."
I don't know what's your problem exactly, but maybe that's what you're looking for: AMD GCN cards have a hardware global synchronization solution. Basically you can have exactly 2*stream_count threads running on the GPU, and you can use global hardware barriers not just sometimes but even at 100KHz rate. I used it for a (sound)wave simulation which ran at 400KHz and it needed a global synch point after every iteration pass to do global gather/scatter things (sum up the samples and do feedback). With consequent kernel launces it would be possible to do also, but absolutely not in realtime (because of the too short kernel times and the relatively big overhead on them). The drawback is that this technique is not supported in the OpenCl language, only in asm, which is unofficial. In case this is what you're looking for, check the gws_barrier instruction in the GCN ISA manual!