cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sp314
Adept II

A peculiar performance gain when running more threads. What am I missing about the GCN and the wave scheduler?

I have a kernel that runs a lot (8 million or so, slow convergence) of pretty straightforward VALU instructions (float 32) on some simple input data. The kernel is a loop with the body of about 200 instructions long, not unrolled or anything. There are no memory reads/writes in the loop body, only some minor loads and stores at the beginning and at the end of the kernel.

The kernel uses exactly 64 vector registers, v0-v63, no LDS, and very few scalar registers. On R290X (Hawaii), when I launch a grid of 64 threads/wave * 44 CUs * 4 waves/CU (in the x dimension, y=z=1) and workgroup size = 64 to match the wave size, I end up with the performance of N loop body iterations per second, whatever the N is, doesn't matter. When I launch a grid of 64*44*8 (eight, instead of four) total threads, I get about 20% performance increase. The results seem to be correct in both cases.

I don't see a reason for this performance gain, so maybe I'm missing something fundamental that I should know about GCN?

To my understanding, 64 VGPRs should limit my code to 4 simultaneous waves per CU, and the entire machine with 44 CUs should be running as many waves at once as it can with a total of 64*44*4 threads. If I'm interpreting things correctly, 64*44*8 grid size should not be of any benefit in this case, since the waves/workgroups past 44*4 will only start running as some of the first 44*4 waves finish.

I've read the ISA docs, the OpenCL optimization guide, and everything else I could find on the subject. Larger grids should not be of help, right?

I have no doubt that this is not the fault of the machine or the drivers, rather me missing or not understanding something correctly. Why the performance gain? What am I missing here?

(This is with ROCm+AMDGPU-PRO on Ubuntu 16.04, if it matters)

Thank you in advance!

7 Replies
dipak
Big Boss

As per the optimization guide,  the limit of active wavefronts per CU is (256 / #REGISTERS_PER_KERNEL) * 4. The document describes it as below:

Each wavefront can have at most 256 registers (VGPRs). To compute the number of wavefronts per CU, take (256/# registers)*4.

For example, a kernel that uses 120 registers (120x32-bit values) can run with eight active wavefronts on each compute unit. Because of the  global limits described earlier, each compute unit is limited to 40 wavefronts; thus, kernels can use up to 25 registers (25x32-bit values) without affecting the number of wavefronts/compute unit.

This indicates more number of active wavefronts per CU in 2nd case i.e. when total number of threads is 64*44*8 instead of 64*44*4, hence might be the performance gain.

Regards,

Thank you, dipak!

You are right, I was confusing SIMDs and CUs, thank you!

Still, each CU has 4 16-wide SIMD units. With 4 waves of 64 threads per CU, each SIMD can take an instruction from a different wave, and execute it for the 64 threads in a wave over 4 cycles. This looks like the full load to me. Why should 8 waves be better at all, as opposed to 4?

I've attached an example kernel which is as simple as I could make it. On this example, I get about 5% performance increase with 8 waves per CU over 4, and about 10% increase with 16 waves per CU.

0 Likes

To clarify, I'm asking why would 8 waves/CU be better than 4 waves/CU when there are no memory reads etc., just a straight up list of VALU instructions with some minor scalar instructions occurring every now and then (like the loop flow control in my example above). Thank you very much!

0 Likes

I understand your point. I've a question though. How did you measure the performance gain? If it includes thread launch time (for example by using a blocking call to clEnqueueNDRangeKernel), the effective performance can improve slightly (depending on kernel execution time) when more number of threads are launched at once.

I would suggest you to run the application with CodeXL and analyze the performance counters and execution timeline to verify/justify the performance gain.

Great, thank you for your answer.

You're right, I've measured the performance using a blocking call. It is not exactly clEnqueueNDRangeKernel() since I'm using ROCm, but still, it is a blocking call. I'm seeing performance increases as I increase global work size, though with diminishing returns, so your thread spawning suggestion may very well be correct.

As to CodeXL, I should definitely use it. There's some minor header conflict or something that makes the latest CodeXL not compile with ROCm out of the box, I'll have to figure it out first (or wait for the fix, or try it on Windows and/or without ROCm), but that's another story.

One way or another, thank you for your help! Things make more sense to me now.

0 Likes

Hi,

Really long ago with a HD7970 I've tested out various instruction streams.

The one you are testing is the simplest one: It contains only 32bit long vector instructions.

I tested with the same long running kernels like you did, and limited the number of threads per CUStream by the VGPRS count. My result for this case (32bit VALU instructions) was that 1 theread per CUStream is enough to get 100% utilization.

The only difference I can think of is this:

4-5 years ago, when you reached near to 100% utilization on the GPU especially with long running kernels, the driver wasn't able to refresh the Windows' desktop either. Everything looked like the computer was freezed. I think, now the driver gives more priority to the OS graphic tasks. So if you pump more wavefronts in, then it will get more attention.

But this is only my guess.

0 Likes

Awesome. Thanks, realhet! I’m on ROCm with a Hawaii card at the moment, but it’s good to know that one thread per CU stream cab achieve 100% utilization.

I’ll try it on a 7970 and on Windows, and see what happens.

0 Likes