There is a (not necessarily negligible) cost attached to FP atomic units. Also, if you consider the typical use for atomics (building blocks for sync primitives), it is unclear where FP atomics would fit in (granted there are some other neat uses). NVIDIA decided the cost to have FP atomic units was worth it in the context of the benefits afforded, so they can do general atomics on FP types (everybody can do exchange). Note that using cmpxchg one can actually implement most of everythiust that it's ther integer or float types, it's just that it's kludgy and hardly perf optimal.
On the GCN architecture we have:
ds_cmpst_f32 (compate+swap), ds_min_f32, ds_max_f32 for Local/Global Data Share.
There are variants of those when the previous value is returned: ds_cmpst_rtn_f32, ds_min_rtn_f32, ds_max_rtn_f32
Also all of those have _f64 versions.
For memory there are some float32 atomics:
buffer_atomic_fcmpswap, buffer_atomic_fmax, image_atomic_fmin
It's possible to return previous values, and you can schedule 2x atomic operations at once: buffer_atomic_fcmpswap_x2, buffer_atomic_fmax_x2