I reduced number or vgpr from 88 to 84. The number of wavefront per compute unit increased from 8 to 12. However, I cannot see any performance gain. The vgpr reduce should not slow down the performance of each work item. So it seems that more occupancy cannot always improve the performance. Any idea why?
According to some online material, it seems that each compute unit has 4 SIMD and each SIMD can run 1 wavefront at a moment. So does that mean each compute unit can at most run 4 wavefronts concurrently? Scheduling more than 4 wavefronts on one compute unit won't improve the performance?