BarnacleJunior

D3D11 cs_5_0 radix sort performance - how to improve?

Discussion created by BarnacleJunior on Dec 28, 2009
Latest reply on Dec 29, 2009 by eduardoschardong

I'm reposting this from an XNA forum post I made.  I'm trying to get a respectable radix sort, but am seeing numbers well below what I think the hardware is capable of.

 

I benchmarked my own cs_5_0 radix sort to death this weekend.  It is quite difficult to write that functhion, as performance is extremely variable with the choice of the number of threads per threadgroup, values per thread, optimal size of the sort digit in bits, etc.  The Garland paper http://mgarland.org/files/papers/gpusort-ipdps09.pdf reports a throughput of 140 million pairs/sec on GTX 280 with arrays of 4 million elements.  The best I was able to do on my HD5850, which is a much faster card (although I admit I don't understand how its integer performance compares) is 64 million pairs/sec.  Garland was able to get that same throughput even on an old 8800 Ultra.  What numbers have you MS guys gotten on this important function?  I really don't know how to bring the numbers up.  I'm unsure how to resolve local data store bank conflicts on Cypress (if that is the culprit).  I've tried perfect hashes (eg ((31 & i)<< 5) | (i>> 5)) on a 10bit index) but it doesn't do anything.  Are there any profiling tools to help me out, or a reference cs_5_0 sort I could get my hands on?  I can clean up my own prefix sum and radix sort benchmark code a bit and distribute it here if that would be helpful.

Anyway, here are the benchmarks I ran last night.  Each sort was run 200 times on 1<<22 sized arrays.  The elements are uint2s, which I've filled with random numbers, and I'm ping-ponging between sorting the .x and .y components.  The peak throughput is for 5 bit digits, 64 threads, and 8 values per thread.  I'm really at a loss why 16 values per thread isn't better in almost every case, as you can increase the amount of sequential work by doing that without increasing the number of barriers or anything else..

.sean


digitSize=4. numThreads=16. valuesPerThread=1. Sorted 10.194M words/sec.
digitSize=4. numThreads=16. valuesPerThread=2. Sorted 17.724M words/sec.
digitSize=4. numThreads=16. valuesPerThread=4. Sorted 27.289M words/sec.
digitSize=4. numThreads=16. valuesPerThread=8. Sorted 36.673M words/sec.
digitSize=4. numThreads=16. valuesPerThread=16. Sorted 35.438M words/sec.
digitSize=4. numThreads=32. valuesPerThread=1. Sorted 16.690M words/sec.
digitSize=4. numThreads=32. valuesPerThread=2. Sorted 27.379M words/sec.
digitSize=4. numThreads=32. valuesPerThread=4. Sorted 40.724M words/sec.
digitSize=4. numThreads=32. valuesPerThread=8. Sorted 49.755M words/sec.
digitSize=4. numThreads=32. valuesPerThread=16. Sorted 45.398M words/sec.
digitSize=4. numThreads=64. valuesPerThread=1. Sorted 26.610M words/sec.
digitSize=4. numThreads=64. valuesPerThread=2. Sorted 42.152M words/sec.
digitSize=4. numThreads=64. valuesPerThread=4. Sorted 54.474M words/sec.
digitSize=4. numThreads=64. valuesPerThread=8. Sorted 57.046M words/sec.
digitSize=4. numThreads=64. valuesPerThread=16. Sorted 48.454M words/sec.
digitSize=4. numThreads=128. valuesPerThread=1. Sorted 26.931M words/sec.
digitSize=4. numThreads=128. valuesPerThread=2. Sorted 36.905M words/sec.
digitSize=4. numThreads=128. valuesPerThread=4. Sorted 42.631M words/sec.
digitSize=4. numThreads=128. valuesPerThread=8. Sorted 49.923M words/sec.
digitSize=4. numThreads=128. valuesPerThread=16. Sorted 38.714M words/sec.
digitSize=4. numThreads=256. valuesPerThread=1. Sorted 16.066M words/sec.
digitSize=4. numThreads=256. valuesPerThread=2. Sorted 27.004M words/sec.
digitSize=4. numThreads=256. valuesPerThread=4. Sorted 39.192M words/sec.
digitSize=4. numThreads=256. valuesPerThread=8. Sorted 45.952M words/sec.
digitSize=4. numThreads=512. valuesPerThread=1. Sorted 8.868M words/sec.
digitSize=4. numThreads=512. valuesPerThread=2. Sorted 14.117M words/sec.
digitSize=4. numThreads=512. valuesPerThread=4. Sorted 26.855M words/sec.
digitSize=4. numThreads=1024. valuesPerThread=1. Sorted 7.084M words/sec.
digitSize=4. numThreads=1024. valuesPerThread=2. Sorted 9.924M words/sec.
digitSize=5. numThreads=32. valuesPerThread=1. Sorted 9.296M words/sec.
digitSize=5. numThreads=32. valuesPerThread=2. Sorted 17.073M words/sec.
digitSize=5. numThreads=32. valuesPerThread=4. Sorted 25.587M words/sec.
digitSize=5. numThreads=32. valuesPerThread=8. Sorted 52.341M words/sec.
digitSize=5. numThreads=32. valuesPerThread=16. Sorted 47.432M words/sec.
digitSize=5. numThreads=64. valuesPerThread=1. Sorted 16.654M words/sec.
digitSize=5. numThreads=64. valuesPerThread=2. Sorted 26.697M words/sec.
digitSize=5. numThreads=64. valuesPerThread=4. Sorted 62.983M words/sec.
digitSize=5. numThreads=64. valuesPerThread=8. Sorted 64.209M words/sec.
digitSize=5. numThreads=64. valuesPerThread=16. Sorted 55.197M words/sec.
digitSize=5. numThreads=128. valuesPerThread=1. Sorted 21.765M words/sec.
digitSize=5. numThreads=128. valuesPerThread=2. Sorted 40.899M words/sec.
digitSize=5. numThreads=128. valuesPerThread=4. Sorted 46.560M words/sec.
digitSize=5. numThreads=128. valuesPerThread=8. Sorted 56.489M words/sec.
digitSize=5. numThreads=128. valuesPerThread=16. Sorted 43.941M words/sec.
digitSize=5. numThreads=256. valuesPerThread=1. Sorted 16.975M words/sec.
digitSize=5. numThreads=256. valuesPerThread=2. Sorted 28.706M words/sec.
digitSize=5. numThreads=256. valuesPerThread=4. Sorted 43.195M words/sec.
digitSize=5. numThreads=256. valuesPerThread=8. Sorted 52.469M words/sec.
digitSize=5. numThreads=512. valuesPerThread=1. Sorted 9.505M words/sec.
digitSize=5. numThreads=512. valuesPerThread=2. Sorted 15.017M words/sec.
digitSize=5. numThreads=512. valuesPerThread=4. Sorted 29.752M words/sec.
digitSize=5. numThreads=1024. valuesPerThread=1. Sorted 7.518M words/sec.
digitSize=5. numThreads=1024. valuesPerThread=2. Sorted 10.778M words/sec.
digitSize=6. numThreads=64. valuesPerThread=1. Sorted 8.046M words/sec.
digitSize=6. numThreads=64. valuesPerThread=2. Sorted 16.634M words/sec.
digitSize=6. numThreads=64. valuesPerThread=4. Sorted 31.434M words/sec.
digitSize=6. numThreads=64. valuesPerThread=8. Sorted 45.359M words/sec.
digitSize=6. numThreads=64. valuesPerThread=16. Sorted 53.156M words/sec.
digitSize=6. numThreads=128. valuesPerThread=1. Sorted 16.062M words/sec.
digitSize=6. numThreads=128. valuesPerThread=2. Sorted 24.877M words/sec.
digitSize=6. numThreads=128. valuesPerThread=4. Sorted 46.939M words/sec.
digitSize=6. numThreads=128. valuesPerThread=8. Sorted 56.465M words/sec.
digitSize=6. numThreads=128. valuesPerThread=16. Sorted 45.396M words/sec.
digitSize=6. numThreads=256. valuesPerThread=1. Sorted 17.468M words/sec.
digitSize=6. numThreads=256. valuesPerThread=2. Sorted 28.950M words/sec.
digitSize=6. numThreads=256. valuesPerThread=4. Sorted 43.296M words/sec.
digitSize=6. numThreads=256. valuesPerThread=8. Sorted 53.728M words/sec.
digitSize=6. numThreads=512. valuesPerThread=1. Sorted 9.217M words/sec.
digitSize=6. numThreads=512. valuesPerThread=2. Sorted 14.310M words/sec.
digitSize=6. numThreads=512. valuesPerThread=4. Sorted 30.464M words/sec.
digitSize=6. numThreads=1024. valuesPerThread=1. Sorted 7.987M words/sec.
digitSize=6. numThreads=1024. valuesPerThread=2. Sorted 11.530M words/sec.

Outcomes