I've been working on a DX11 sort for a while. It's finally working, and working very very well. I'm seeing 329 million pairs/sec for deinterleaved key/value pairs, 279 million pairs/sec for interleaved pairs, and 408 million uints/sec for keys alone. By comparison, this rather authoritative paper ( http://mgarland.org/files/papers/gpusort-ipdps09.pdf) reports only 145 million pairs/sec on GTX 280 using CUDPP. I have a more complete write-up here. I would be interested in comments.
http://forums.xna.com/forums/p/46766/279871.aspx#279871
.sean