One of my GPU kernel heavily relies on atomic floating point addition in local memory. My current implementation uses a loop of atom_cmpxchg() functions.
I noticed that in the GCN3 instruction set manual there is a DS_ADD_F32 instruction, but there are very little details. Is it the correct instruction to use for atomic floating point addition in local data share? Are there any special requirements and caveats to use this instruction? How about its performance (comparing to a loop of DS_CMPST_RTN_B32)?
My initial test on RX 480 shows that DS_ADD_F32 can do atomic add correctly, but it is quite slow (can be a few times slower than a loop of atom_cmpxchg() function calls). But I am not sure if could be faster on Fiji or Vega.