My application runs a series of 7 kernels, and most of the time is taken by the 7th kernel.
This kernel has 50% occupancy.
Card is RX 470, 4GB.
For this 7th kernel, there are two settings: the first gives my a total of 100 wavefronts,
while the second gives me a total of only 30 wavefronts.
Timing for the second setting is about 3X slower than for the first. VALU utilization is about the same
I am guessing that the time is slower for the second because 30 wavefronts is not enough to
hide memory latency. Is there a way of calculating the optimal number of total wavefronts for a kernel,
given the occupancy and the number of CUs ?
As a follow up question, and pardon me if I'm wrong, but doesn't 30 waves on a machine with 32 CUs (RX 470) mean that there's no memory latency hiding at all?
Say, the first 30 CUs pick up one wave each (two CUs are idling), and one SIMD unit per CU is working on the wave it picked up, processing 4 x 16-wide things in 4 cycles. When it gets stuck on a memory access, what is there to switch to in order to hide the latency? (Similar logic applies in case of 4 SIMDs running 4 waves at once on one CU, I think.)
This is a follow up question.
Thanks. Yes, that makes sense. This would explain why performance is so poor with only 30 wavefronts.
Given that each CU can run at most 10 wavefronts, and occupancy is .37, I guess the optimal number of wavefronts
is at least 32 * 3.7 ~= 120 wavefronts.
The situation is more complex because of the 6 other kernels that could also be running on a CU.
Try setting the environment variable GPU_WAVE_LIMIT_ENABLE=1, before starting the program, which should enable the adaptive wave limiter for AMD's OpenCL driver. It's showing some positive difference for me, but it may depend on the GPU model, and workload.
I found them while browsing through the open source driver, but so far the one I mentioned above was the only really helpful thing for me: ROCm-OpenCL-Runtime/flags.hpp at master · RadeonOpenCompute/ROCm-OpenCL-Runtime · GitHub