I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work item. So it seems that more occupancy cannot always improve the performance. Any idea why?
According to some online material, it seems that each compute unit has 4 SIMD and each SIMD can run 1 wavefront at a moment. So does that mean each compute unit can at most run 4 wavefronts concurrently? Scheduling more than 4 wavefronts on one compute unit won't improve the performance?
In GCN, each SIMD can have up to 10 in-flight or active wavefronts; so total 40 active wavefronts per CU. In general, higher number of active wavefronts (or higher occupancy) helps to hide the memory latency, thus improve the overall performance. The suitable value depends on multiple factors such as ALU and memory usage, memory bandwidth, application logic etc. For example, a higher occupancy may be more useful for an application where memory usage is high than an ALU-bound application. If increasing the occupancy does not improve the performance, it means that the GPU has enough number of active wavefronts to hide the latency. As AMD OpenCL optimization guide says that:
Increasing the wavefronts/compute unit does not indefinitely improve performance; once the GPU has enough wavefronts to hide latency, additional active wavefronts provide little or no performance benefit. A closely related metric to wavefronts/compute unit is “occupancy,” which is defined as the ratio of active wavefronts to the maximum number of possible wavefronts supported by the hardware.
For more information, please refer this section: OPENCL Optimization — ROCm Documentation latest documentation