I'm trying to get global barriers to work on an R290x (Hawaii), and they're not working for the reason I do not understand. I'm posting in hopes of getting an advice, because I'm out of ideas.
First, the first wave to come in initializes the barrier, like so (I'm trying it out with only one wave at the moment)
/*000000000074: 7e0002c0 */ v_mov_b32 v0, 1
/*000000000078: d8660000 00000000*/ ds_gws_init v0 gds
/*000000000080: bf8c007f */ s_waitcnt lgkmcnt(0)
To detect the first wave, I'm using an atomic add, as described in this thread
After that, all threads (intentionally) waste some cycles with either a bunch of memory writes or multiple s_nop instructions, since I remember seeing a post somewhere that says that the barrier must be initializes a few hundred cycles before use. Not sure if it is necessary or not, but still, I put the delay in just in case.
Afterwards, all threads wait on the barrier, like so
/*0000000001e8: 7e000280 */ v_mov_b32 v0, 1
/*0000000001ec: d8760000 00000000*/ ds_gws_barrier v0 gds
/*0000000001f4: bf8c007f */ s_waitcnt lgkmcnt(0)
and then just exit.
I've tried multiple values for the wave counts, setting v0 (see above code) to the number of waves, the number of threads, etc. The code seems to get stuck on ds_gws_barrier, until the driver resets the videocard. I have verified that the instruction encoding generated by CLRadeonExtender is correct by comparing the machine code I'm getting to the code I should be getting according the Southern Islands ISA doc (found here https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf).
As some other threads suggest, I'm setting m0 to 0x1000, but again, I've tried different values here too and the code still locks up.
I'm on Windows 10 with the latest available drivers from AMD, if that matters. What other info do I need to provide to get better answers, and what would you suggest I should try?
What am I doing wrong?