Multiple wavefronts, bank conflicts, and scan question

Discussion created by BarnacleJunior on Jan 6, 2010
Latest reply on Jan 7, 2010 by BarnacleJunior

I'm still working on getting good prefix scan performance, with an algorithm like this one

With 64 threads processing 8 values each, I get 5171M uints/sec, and with 256 threads processing 8 each I get 5444M uints/sec.  All other combinations are lower.  I am on a 5850, but even the 5444 term is lower than the 6300M reported in the NV paper for a GTX280.

I'm using a scan like this:

void ThreadSumWavefront(uint tid, uint scansize) {
    uint lane = (WAVEFRONT - 1) & tid;
    uint laneMask = ~(WAVEFRONT - 1) & tid;
    for(uint offset = 1; offset < scansize; offset<<= 1) {
        uint tid2 = ((tid - offset) & (scansize - 1)) | laneMask;
        uint target = sharedSum[tid];
        uint source = sharedSum[tid2];
        bool cond = lane >= offset;
        target += cond ? source : 0;
        sharedSum[tid] = target;

For a single wavefront, that is the only scan I need.  For multiple wavefronts, I have to take each 63==lane thread and gather those inclusive scans to the LDS and run the scan again, but with scansize=NUM_WAVEFRONTS rather than scansize=64.

Is this a good strategy on ATI hardware?  the four wavefront version with 8 values per thread barely beats the best single wavefront version.  In most the stuff I've experimented with, single wavefront versions are most efficient, especially as the shaders get more complicated.  Also, do bank conflicts get in the way of widening the scan LDS to type uint4 rather than uint?  With only 32 LDS channels and 16 active threads, if each thread accesses its own slot in the LDS each cycle does that imply a 2-way bank conflict on each channel?  I've tried with uint2 LDS arrays, which I think avoid bank conflicts, but the added complexity seems to negate a performance benefit from the better ALU packing in the scan.

Is there a high performance scan and sort that ATI can share, similar to CUDPP?