Memory bandwidth performance degradation in non-divergent dynamic loops

Discussion created by BarnacleJunior on Feb 8, 2010
Latest reply on Feb 15, 2010 by BarnacleJunior

I'm writing a sparse matrix * vector shader.  It has been working for a week, but performance was terrible - about 1/4th what it should have been.  I worked it down to the fact that memory bandwidth apparently nosedives when fetching inside a dynamic loop, even when the branch is non-divergent and all the reads are coalesced. 

To test this out, I wrote a short D3D compute shader where each threadgroup pulls an offset from an SRV, then loops over a number of blocks, adds up the values thread-wise, and writes the sums back out.  In the dynamic case, the count is baked into the high 6 bits of the offset (it is the same for all blocks in this example).  In the static case it is a preprocessor macro.  I thought that the penalty for a non-divergent loop would not be so bad, but it cuts fetch bandwidth way way down.

Can someone explain what to do about this problem?  It makes it very problematic to write decent mathematical kernels when performance plummets simply by using a loop.  In my test I did implement static loops for powers of 2.  It does accelerate things a good deal over the basic dynamic loop, but will complicate any non-test code tremendously.  Is this performance problem an issue with DirectCompute.. or is it intrinsic to all GPU computing, or just ATI?  I don't recall reading about this particular problem in CUDA literature (but I've never actually used that toolkit).  Additionally I put in a method that uses the [unroll(n)] modifier.  The IL generated does a fetch for each n, but uses movc to only accumulate the value when the loop counter is less than n.  This appears the fastest method, but I fear it would also degrade badly when there is a lot of ALU work inside the loop.

Would there happen to be some way to indicate to the HLSL compiler that a loop is non-divergent, even if it is impossible to unroll it?

In the following numbers, I'm not including in the fetch bandwidth measurement the sourceRange[gid] fetch.  Each test is run 5000 iterations for accuracy.  I'm on a stock HD5850.

STATIC_LOOP: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 24.941 GB/s   write: 12.471 GB/s
DYNAMIC_LOOP: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 15.230 GB/s   write: 7.615 GB/s
DYNAMIC_LOOP2: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 16.946 GB/s   write: 8.473 GB/s
DYNAMIC_LOOP3: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 16.301 GB/s   write: 8.151 GB/s

STATIC_LOOP: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 65.679 GB/s   write: 9.383 GB/s
DYNAMIC_LOOP: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 27.496 GB/s   write: 3.928 GB/s
DYNAMIC_LOOP2: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 34.199 GB/s   write: 4.886 GB/s
DYNAMIC_LOOP3: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 52.425 GB/s   write: 7.489 GB/s

STATIC_LOOP: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 83.428 GB/s   write: 6.952 GB/s
DYNAMIC_LOOP: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 31.545 GB/s   write: 2.629 GB/s
DYNAMIC_LOOP2: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 57.558 GB/s   write: 4.797 GB/s
DYNAMIC_LOOP3: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 67.807 GB/s   write: 5.651 GB/s

STATIC_LOOP: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 87.304 GB/s   write: 5.136 GB/s
DYNAMIC_LOOP: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 33.538 GB/s   write: 1.973 GB/s
DYNAMIC_LOOP2: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 73.013 GB/s   write: 4.295 GB/s
DYNAMIC_LOOP3: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 74.582 GB/s   write: 4.387 GB/s

STATIC_LOOP: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 93.726 GB/s   write: 4.260 GB/s
DYNAMIC_LOOP: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 34.553 GB/s   write: 1.571 GB/s
DYNAMIC_LOOP2: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 75.626 GB/s   write: 3.438 GB/s
DYNAMIC_LOOP3: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 90.106 GB/s   write: 4.096 GB/s

STATIC_LOOP: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 97.186 GB/s   write: 3.599 GB/s
DYNAMIC_LOOP: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 35.375 GB/s   write: 1.310 GB/s
DYNAMIC_LOOP2: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 79.275 GB/s   write: 2.936 GB/s
DYNAMIC_LOOP3: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 88.817 GB/s   write: 3.290 GB/s

StructuredBuffer<uint> sourceRange : register(t0); StructuredBuffer<uint> sourceData : register(t1); RWStructuredBuffer<uint> targetData : register(u0); uint SumLoop(uint offset, uint tid, uint count) { uint sum = 0; for(uint i = 0; i < count; ++i) sum += sourceData[offset + i * NUM_THREADS + tid]; return sum; } [numthreads(NUM_THREADS, 1, 1)] void CopyShader2(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint range = sourceRange[gid]; uint offset = 0x03ffffff & range; uint count = range>> 26; uint sum; #if defined(STATIC_LOOP) sum = SumLoop(offset, tid, STATIC_LOOP); #elif defined(DYNAMIC_LOOP) sum = SumLoop(offset, tid, count); #elif defined(DYNAMIC_LOOP2) sum = 0; while(count > 0) { // Attack by powers of 2 if(count < 2) { sum += SumLoop(offset, tid, 1); offset += NUM_THREADS * 1; count -= 1; } else if(count < 4) { sum += SumLoop(offset, tid, 2); offset += NUM_THREADS * 2; count -= 2; } else if(count < 8) { sum += SumLoop(offset, tid, 4); offset += NUM_THREADS * 4; count -= 4; } else if(count < 16) { sum += SumLoop(offset, tid, 8); offset += NUM_THREADS * 8; count -= 8; } else { sum += SumLoop(offset, tid, 16); offset += NUM_THREADS * 16; count -= 16; } } #elif defined(DYNAMIC_LOOP3) while(count > 0) { uint c = min(count, 8); [unroll(8)] for(uint i = 0; i < c; ++i) sum += sourceData[offset + i * NUM_THREADS + tid]; count -= c; offset += c * NUM_THREADS; } #endif targetData[NUM_THREADS * gid + tid] = sum; }