cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nerdralph
Adept II

peak GCN performance possible with 1 wave

Everything I've read so far about wave occupancy suggests (or even explicitly states) that a minimum of 4 waves in flight is required for full VALU occupancy on GCN.  After scrutinizing documentation and code, I've come to the conclusion that full VALU utilization can be obtained with just one wave.  This is only possible for kernels executing only vector instructions, so for practical purposes the minimum is 2 waves.

Nerd Ralph: Inside AMD GCN code execution

0 Likes
3 Replies
meriken
Adept III

This is a very fascinating possibility. I was just wondering if I could optimize my code by promoting SGPR's to VGPR's.

0 Likes
realhet
Miniboss

Hi,

The VALU and SALU pipeline is 4 stages long at minimum. So you gonna need those 4 waves.

0 Likes

From the AMD GCN whitepaper:

"Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,

with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer

operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,

but takes 4 cycles to execute operations for all 64 work items"

So with just one wave, work on 16 items starts on the first cycle, the next 16 are started on the 2nd cycle (while SIMD1 starts its first 16 items), then another 16 are started on the 3rd cycle, and the last 16 are started on the 4th cycle.  On the 5th cycle the SIMD is ready to receive a new instruction (except for those instructions that take more than 4 cycles).

0 Likes