I've been working on a DX11 sort for a while. It's finally working, and working very very well. I'm seeing 329 million pairs/sec for deinterleaved key/value pairs, 279 million pairs/sec for interleaved pairs, and 408 million uints/sec for keys alone. By comparison, this rather authoritative paper ( http://mgarland.org/files/papers/gpusort-ipdps09.pdf) reports only 145 million pairs/sec on GTX 280 using CUDPP. I have a more complete write-up here. I would be interested in comments.
Though I've seen better radix sorts for CUDA than the one you linked to. I wouldn't put much credibiliity in any "published" papers since the publication process is so broken (I'm sure most of the academic community would disagree with me here, of course). Also, most of the CUDA libraries are not that efficient. For example, I've seen great speeds up FFT on CUDA when compared to CUFFT.