Good job, I'm really impressed! Never thought it could be that fast.
Though I've seen better radix sorts for CUDA than the one you linked to. I wouldn't put much credibiliity in any "published" papers since the publication process is so broken (I'm sure most of the academic community would disagree with me here, of course). Also, most of the CUDA libraries are not that efficient. For example, I've seen great speeds up FFT on CUDA when compared to CUFFT.