I have a kernel that runs a lot (8 million or so, slow convergence) of pretty straightforward VALU instructions (float 32) on some simple input data. The kernel is a loop with the body of about 200 instructions long, not unrolled or anything. There are no memory reads/writes in the loop body, only some minor loads and stores at the beginning and at the end of the kernel.
The kernel uses exactly 64 vector registers, v0-v63, no LDS, and very few scalar registers. On R290X (Hawaii), when I launch a grid of 64 threads/wave * 44 CUs * 4 waves/CU (in the x dimension, y=z=1) and workgroup size = 64 to match the wave size, I end up with the performance of N loop body iterations per second, whatever the N is, doesn't matter. When I launch a grid of 64*44*8 (eight, instead of four) total threads, I get about 20% performance increase. The results seem to be correct in both cases.
I don't see a reason for this performance gain, so maybe I'm missing something fundamental that I should know about GCN?
To my understanding, 64 VGPRs should limit my code to 4 simultaneous waves per CU, and the entire machine with 44 CUs should be running as many waves at once as it can with a total of 64*44*4 threads. If I'm interpreting things correctly, 64*44*8 grid size should not be of any benefit in this case, since the waves/workgroups past 44*4 will only start running as some of the first 44*4 waves finish.
I've read the ISA docs, the OpenCL optimization guide, and everything else I could find on the subject. Larger grids should not be of help, right?
I have no doubt that this is not the fault of the machine or the drivers, rather me missing or not understanding something correctly. Why the performance gain? What am I missing here?
(This is with ROCm+AMDGPU-PRO on Ubuntu 16.04, if it matters)
Thank you in advance!