Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Memory bandwidth performance degradation in non-divergent dynamic loops

I'm writing a sparse matrix * vector shader.  It has been working for a week, but performance was terrible - about 1/4th what it should have been.  I worked it down to the fact that memory bandwidth apparently nosedives when fetching inside a dynamic loop, even when the branch is non-divergent and all the reads are coalesced. 

To test this out, I wrote a short D3D compute shader where each threadgroup pulls an offset from an SRV, then loops over a number of blocks, adds up the values thread-wise, and writes the sums back out.  In the dynamic case, the count is baked into the high 6 bits of the offset (it is the same for all blocks in this example).  In the static case it is a preprocessor macro.  I thought that the penalty for a non-divergent loop would not be so bad, but it cuts fetch bandwidth way way down.

Can someone explain what to do about this problem?  It makes it very problematic to write decent mathematical kernels when performance plummets simply by using a loop.  In my test I did implement static loops for powers of 2.  It does accelerate things a good deal over the basic dynamic loop, but will complicate any non-test code tremendously.  Is this performance problem an issue with DirectCompute.. or is it intrinsic to all GPU computing, or just ATI?  I don't recall reading about this particular problem in CUDA literature (but I've never actually used that toolkit).  Additionally I put in a method that uses the [unroll(n)] modifier.  The IL generated does a fetch for each n, but uses movc to only accumulate the value when the loop counter is less than n.  This appears the fastest method, but I fear it would also degrade badly when there is a lot of ALU work inside the loop.

Would there happen to be some way to indicate to the HLSL compiler that a loop is non-divergent, even if it is impossible to unroll it?

In the following numbers, I'm not including in the fetch bandwidth measurement the sourceRange[gid] fetch.  Each test is run 5000 iterations for accuracy.  I'm on a stock HD5850.

STATIC_LOOP: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 24.941 GB/s   write: 12.471 GB/s
DYNAMIC_LOOP: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 15.230 GB/s   write: 7.615 GB/s
DYNAMIC_LOOP2: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 16.946 GB/s   write: 8.473 GB/s
DYNAMIC_LOOP3: blocksPerThread=2 numThreads=64 numBlocks=16384
read: 16.301 GB/s   write: 8.151 GB/s

STATIC_LOOP: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 65.679 GB/s   write: 9.383 GB/s
DYNAMIC_LOOP: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 27.496 GB/s   write: 3.928 GB/s
DYNAMIC_LOOP2: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 34.199 GB/s   write: 4.886 GB/s
DYNAMIC_LOOP3: blocksPerThread=7 numThreads=64 numBlocks=16384
read: 52.425 GB/s   write: 7.489 GB/s

STATIC_LOOP: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 83.428 GB/s   write: 6.952 GB/s
DYNAMIC_LOOP: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 31.545 GB/s   write: 2.629 GB/s
DYNAMIC_LOOP2: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 57.558 GB/s   write: 4.797 GB/s
DYNAMIC_LOOP3: blocksPerThread=12 numThreads=64 numBlocks=16384
read: 67.807 GB/s   write: 5.651 GB/s

STATIC_LOOP: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 87.304 GB/s   write: 5.136 GB/s
DYNAMIC_LOOP: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 33.538 GB/s   write: 1.973 GB/s
DYNAMIC_LOOP2: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 73.013 GB/s   write: 4.295 GB/s
DYNAMIC_LOOP3: blocksPerThread=17 numThreads=64 numBlocks=16384
read: 74.582 GB/s   write: 4.387 GB/s

STATIC_LOOP: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 93.726 GB/s   write: 4.260 GB/s
DYNAMIC_LOOP: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 34.553 GB/s   write: 1.571 GB/s
DYNAMIC_LOOP2: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 75.626 GB/s   write: 3.438 GB/s
DYNAMIC_LOOP3: blocksPerThread=22 numThreads=64 numBlocks=16384
read: 90.106 GB/s   write: 4.096 GB/s

STATIC_LOOP: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 97.186 GB/s   write: 3.599 GB/s
DYNAMIC_LOOP: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 35.375 GB/s   write: 1.310 GB/s
DYNAMIC_LOOP2: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 79.275 GB/s   write: 2.936 GB/s
DYNAMIC_LOOP3: blocksPerThread=27 numThreads=64 numBlocks=16384
read: 88.817 GB/s   write: 3.290 GB/s

StructuredBuffer<uint> sourceRange : register(t0); StructuredBuffer<uint> sourceData : register(t1); RWStructuredBuffer<uint> targetData : register(u0); uint SumLoop(uint offset, uint tid, uint count) { uint sum = 0; for(uint i = 0; i < count; ++i) sum += sourceData[offset + i * NUM_THREADS + tid]; return sum; } [numthreads(NUM_THREADS, 1, 1)] void CopyShader2(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint range = sourceRange[gid]; uint offset = 0x03ffffff & range; uint count = range>> 26; uint sum; #if defined(STATIC_LOOP) sum = SumLoop(offset, tid, STATIC_LOOP); #elif defined(DYNAMIC_LOOP) sum = SumLoop(offset, tid, count); #elif defined(DYNAMIC_LOOP2) sum = 0; while(count > 0) { // Attack by powers of 2 if(count < 2) { sum += SumLoop(offset, tid, 1); offset += NUM_THREADS * 1; count -= 1; } else if(count < 4) { sum += SumLoop(offset, tid, 2); offset += NUM_THREADS * 2; count -= 2; } else if(count < 😎 { sum += SumLoop(offset, tid, 4); offset += NUM_THREADS * 4; count -= 4; } else if(count < 16) { sum += SumLoop(offset, tid, 8); offset += NUM_THREADS * 8; count -= 8; } else { sum += SumLoop(offset, tid, 16); offset += NUM_THREADS * 16; count -= 16; } } #elif defined(DYNAMIC_LOOP3) while(count > 0) { uint c = min(count, 8); [unroll(8)] for(uint i = 0; i < c; ++i) sum += sourceData[offset + i * NUM_THREADS + tid]; count -= c; offset += c * NUM_THREADS; } #endif targetData[NUM_THREADS * gid + tid] = sum; }

6 Replies
Journeyman III

I think the problem is related to the number of fetch clauses, try to keep at least 4 loads together.


Can you elaborate?  Four loads because four groups of 16 active threads?  I really don't know what I'm doing wrong there.


I zipped up the project (very small amount of code).. If you have any idea on how to get the performance up in that DYNAMIC_LOOP case I'd be very interested.

Text of the timings are at



I hope you will forgive me for not testing (no time) and only talking from a theorical base, and also by a few english mistakes because I'm too tyred 

Normal ALU instructions and memory instructions aren't executed at the same place, if there is a for(int i = 0; i < x; i++) y += z; the kernel will start executing ALU instructions in some place then memory on another, then ALU, then memory again and so on each iteration and it take some time to move from one place to another, this time to switch clauses will make your application slower, if you put and unroll(2) on the statement above there will be less clause switches and then it will run faster.

There are some minor problems about conditional statements and the compiler not being able to understand 1 + 2 + 3 + 4 is the same as (1 + 2) + (3 + 4), could your sourceData be uint4?

About conditionals, in the second and third dynamic loops, in the second, forgot about long if-elseif sequences, even if they are non-divergent there will be many clauses, in the third, since the compiler cannot determine if all 8 values will be used it will generate a lot of cmoves, it's ok, the problem is it's dumby and will do on every iteration..., But it's simple to solve.


#elif defined(DYNAMIC_LOOP4) uint c = count / 8 * 8; for(uint j = 0; j < c; j += 😎 [unroll(8)] for(uint i = 0; i < 8; ++i) sum += sourceData[offset + (i + j) * NUM_THREADS + tid]; for(uint i = c; i < count; ++i) sum += sourceData[offset + i * NUM_THREADS + tid];


The main problem with your program is that it's accessing memory in the worst possible access pattern for your hardware.  The chip accesses memory through channels and banks.  The channel used to access memory is controlled through bits 10:8 of the memory address.  The bank is selected using bits 14:11 of the address.

Your test accesses memory in a loop based on some constants + loop_iterator * num_threads, where num_threads = 64.  This means you're accessing the same bank and channel of memory over and over, saturating it.

While it is natural to pick a nice, round number like as the pitch of memory you're accessing, you'd be better off using a pitch which causes you to rotate through all of the memory channels instead of using the same one over and over.

To do this, I downloaded your test and modified NUM_THREADS to be 257.  I saw performance increase from 14 GB/s to 62 GB/s.

257 is not a great number of threads to use for other reasons, but since your program is so thoroughly memory bound, the memory access speed ends up being the primary bottleneck in the test.


Ahh that's very interesting.. But why is it only slow for the dynamic case?  The statically unrolled case was hitting peak bandwidth, while reading the exact same bytes in the same order as the dynamic loop case.