3 Replies Latest reply on Feb 24, 2017 12:24 PM by nerdralph

    peak GCN performance possible with 1 wave

    nerdralph

      Everything I've read so far about wave occupancy suggests (or even explicitly states) that a minimum of 4 waves in flight is required for full VALU occupancy on GCN.  After scrutinizing documentation and code, I've come to the conclusion that full VALU utilization can be obtained with just one wave.  This is only possible for kernels executing only vector instructions, so for practical purposes the minimum is 2 waves.

      Nerd Ralph: Inside AMD GCN code execution

        • Re: peak GCN performance possible with 1 wave
          meriken

          This is a very fascinating possibility. I was just wondering if I could optimize my code by promoting SGPR's to VGPR's.

          • Re: peak GCN performance possible with 1 wave
            realhet

            Hi,

            The VALU and SALU pipeline is 4 stages long at minimum. So you gonna need those 4 waves.

              • Re: peak GCN performance possible with 1 wave
                nerdralph

                From the AMD GCN whitepaper:

                "Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,

                with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer

                operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,

                but takes 4 cycles to execute operations for all 64 work items"

                 

                So with just one wave, work on 16 items starts on the first cycle, the next 16 are started on the 2nd cycle (while SIMD1 starts its first 16 items), then another 16 are started on the 3rd cycle, and the last 16 are started on the 4th cycle.  On the 5th cycle the SIMD is ready to receive a new instruction (except for those instructions that take more than 4 cycles).