7 Replies Latest reply on Jan 21, 2014 8:45 AM by arvin99

    How GCN scheduling work??

    arvin99

      Hi, I am confused about how GCN scheduling work.

      Let's say, if I define work group size is 32  and global work items is 262144 in OpenCL, then in four cycle clock there will be 32 PE (Processing Element)

      ( 16 PE in first clock cycle and 16 PE in second clock cycle) that worked and the other PE is idle for single wavefront right??

      The total wavefront that worked will be 8192  wavefronts.

      How about GCN ?? If I define work group size less than 64, are there idle PE or it is dynamically work for the next wavefront??

      Can someone explain to me by using these two images ??


      a) Taken from AnandTech Portal | AMD's Graphics Core Next Preview: AMD's New GPU, Architected For Compute

          Wavefront Execution Example: SIMD vs. VLIW. Not To Scale - Wavefront Size 16

      http://images.anandtech.com/doci/4455/Wavefront.png


      b) Taken from AMD: Southern Islands (7*** series) Speculation/ Rumour Thread - Page 33 - Beyond3D Forum

          Copyright to Hiroshige Goto

        • Re: How GCN scheduling work??
          developer

          Workgroup sizes < 64 will result in Idle Cycles.

          - Bruhapati

          1 of 1 people found this helpful
          • Re: How GCN scheduling work??
            realhet

            The second image is kinda accurate.You can see that a wavefront (which always has 64 workitems, even when you specify only 32) has 4 cycle latency and processed in 4 clocks 16by16. Also there is a slow double precision instruction which takes 16 clocks. Notice the 4 SIMD units aren't in synchron: they're following each other with 1 clock latency. That's because the Scalar alu which is not on the image. The S alu starts an instruction in every clock for a different Vector alu and 4 cycles later itgives the result back. The S alu is also 4x pipelined and in every 4 clock it server all the 4 Vector SIMD ALUes.

            If you see one clock line, you can sum up the performance of the whole unit: There you can issue 16*4 float32 MulAdds: that's 16*4*2=128 flops. With the S alu you can do an additional 64bit integer operation too. There are 32 SIMD Engines in the Tahiti, thus the whole card does 128*32 = 4096 flops / cycle (plus 32 int64 ops/cycle). The default clock is 925MHz so the total performance is 4096*925 = 3.7888 TFlops/s plus 29.6 Gops/s of int64 operations.

             

            32 workitems in a workgroup -> that's way too low.

            In order to achieve the above performance of a GCN you must give it at least number_of_streams(32*64=2048 on Tahiti) *4 workitems. And you must have 64,128,192 or 256 workitems in a workgroup. That's the minimum to have every execution unit work.

            1 of 1 people found this helpful
              • Re: How GCN scheduling work??
                arvin99

                Thanks for reply, realhet.

                I know that I must give large wok item for each work group with multiply of 64.

                But, what I want to know actually is dynamic scheduling.

                If I define work group size is 16, is there idle ALU or not??

                In VLIW,  if I define work group size is 32  and global work items is 262144 in OpenCL, then in four cycle clock there will be 32 PE (Processing Element)

                ( 16 PE in first clock cycle and 16 PE in second clock cycle) that worked and the other PE is idle for single wavefront. The total wavefront that worked will be 8192  wavefronts.

                How about GCN??