My application runs a series of 7 kernels, and most of the time is taken by the 7th kernel.
This kernel has 50% occupancy.
Card is RX 470, 4GB.
For this 7th kernel, there are two settings: the first gives my a total of 100 wavefronts,
while the second gives me a total of only 30 wavefronts.
Timing for the second setting is about 3X slower than for the first. VALU utilization is about the same
I am guessing that the time is slower for the second because 30 wavefronts is not enough to
hide memory latency. Is there a way of calculating the optimal number of total wavefronts for a kernel,
given the occupancy and the number of CUs ?