This is a very fascinating possibility. I was just wondering if I could optimize my code by promoting SGPR's to VGPR's.
The VALU and SALU pipeline is 4 stages long at minimum. So you gonna need those 4 waves.
From the AMD GCN whitepaper:
"Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items"
So with just one wave, work on 16 items starts on the first cycle, the next 16 are started on the 2nd cycle (while SIMD1 starts its first 16 items), then another 16 are started on the 3rd cycle, and the last 16 are started on the 4th cycle. On the 5th cycle the SIMD is ready to receive a new instruction (except for those instructions that take more than 4 cycles).