9 Replies Latest reply on Nov 13, 2009 7:33 PM by CaptainN

    Compute Shader scheduling

    DTop
      About deterministic scheduler behavior, wavefront run and LDS access.

      I have read many threads about the topic here, and one of the best one probably this one:

      http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99919

      however, there are number of simple questions outstanding:

       

      a)      it seems to be conclusive, that wavefronts are executing within the Thread Group. Thread Group size is defined by gridBlock.width parameter of CALprogramGrid structure. And number of Thread Groups are defined as domain execution size (in pixels) devided by Thread Group size.

      b)      If Thread Group size is twice more then actual execution units (for 7xx # of execution units seems to be == 64), and set in kernel and in gridBlock.width, whether Thread Group will queue 2 wavefronts on the same SIMD still being within the same Thread Group without interruption?

      c)       If fence_ work per Group, and Group Size is more then available execution units, and execution split on 2 wavefornts (case b above), whether first wavefront will be deferred until second wavefront will reach the barrier, to have first wavefront to be continued? Or it is just incorrect setting to have Group Size > then actual execution units per SIMD?

      d)      If wavefornt size is set to ½ of executing units of SIMD, whether half of SIMD will be wasted or another Group will be started on half of SIMD?

      e)      If there are more Groups set then available SIMDs, whether groups will be scheduled for execution one after another in some non-predictive order until finished?

      f)        Once wavefront execution finished, whether LDS content remains persistent between wavefront runs, so next Thread Group will find LDS content from previous wavefront and can be reused?

        • Compute Shader scheduling
          MicahVillmow
          A) All wavefronts within a thread group are executed on a single SIMD. Two wavefronts can execute in parallel(commonly refered to as the even and odd wavefront) on a single SIMD.
          B) Thread groups are scheduled on SIMD's until the SIMD cannot hold any more thread groups and then it waits for more resources to be cleared by the execution of a thread group finishing.
          C) If your group size is larger than a single wavefront, then when the first wavefront hits the barrier, it will wait for the rest of the wavefronts to hit the barrier before continuing execution.
          D) The other half of the wavefront is marked inactive and no execution will occur with those threads.
          E) The order is sequential over the SIMD's before wrapping to the original SIMD. On 7XX this behavior can be modified with setting the addressing mode to wavefront absolute instead of wavefront relative, this causes only a single group to be executed per SIMD no matter what. There can only be as many groups executed per SIMD as resources allow.
          F) Within a single kernel execution, LDS content is persistent. Between kernel executions, LDS content is persistent within the same command buffer, which is only gauranteed with the calCtxRunProgramGridArray API call. If this API call cannot fit all kernels in a single command buffer, then the call fails.
          • Compute Shader scheduling
            DTop

            Thank you, Micah!!!

            I appreciate your time answering this questions.

             

            To clarify:

             

            B) Do you mean that if 1 thread group scheduled to 1 SIMD, and if thread group require more resources then SIMD can give for this group then scheduler will wait till part for the thread group finishes first, and schedule another wavefront for the same group, and does not release SIMD resource until all threads are finished? Whether control has not returned back until all threads for given Group has finished?

             

            D) Presuming 1 SIMD has 64 execution units (7xx case):

            Did you describe the case when kernel has (pseudo code below)

            dcl_num_thread_per_group 64

             

            CALProgramGrid.gridBlock::width = 64

            CALProgramGrid.gridBlock::height = 1

            CALProgramGrid.gridBlock:epth = 1

             

            CALProgramGrid.gridSize::width = 1

            CALProgramGrid.gridSize::width = 1

            CALProgramGrid.gridSize::width = 1

             

             

            But

            CALdomain3D::width = 8

            CALDomain3D::heigh = 4

            (making domain size eq. to  32, what is half of threads declared) so in this case half of the SIMD will be wasted?

             

            -------------------

            Whether the second half of SIMD will be wasted when

            1 SIMD has 64 execution units (7xx case) and

            Kernel has

            dcl_num_thread_per_group 32

             

            CALProgramGrid.gridBlock::width = 32

            CALProgramGrid.gridBlock::height = 1

            CALProgramGrid.gridBlock:epth = 1

             

            CALProgramGrid.gridSize::width = 1

            CALProgramGrid.gridSize::width = 1

            CALProgramGrid.gridSize::width = 1

             

            CALdomain3D::width = 8

            CALDomain3D::heigh = 4

            ?

             

            Or it will be able to accept another similar Thread Group say, from another context? Even if kernel program in another context is different?

             

            E) What is the behavior for 8xx?

             

              • Compute Shader scheduling
                MicahVillmow
                B) If your thread group requires more resources than the entirety of the SIMD, execution or compilation will fail. It is an all or nothing approach.

                D) yes, half of the threads will be wasted in both examples. Multiple groups are not packed into a single wavefront.

                E) 8XX is a derivative of 7XX , so the behaviour should be the same, but I have not had a chance to look deeply into it yet.
                • Compute Shader scheduling
                  DTop

                  Dear Micah,

                  just want to get down on this

                  B.1) Using following declarations (pseudo code below)

                  dcl_num_thread_per_group 64

                   

                  CALProgramGrid.gridBlock::width = 64

                  CALProgramGrid.gridBlock::height = 1

                  CALProgramGrid.gridBlock:: depth = 1

                   

                  CALProgramGrid.gridSize::width = 1024

                  CALProgramGrid.gridSize::height = 1

                  CALProgramGrid.gridSize:: depth = 1

                   

                   

                  With

                  CALdomain3D::width = 256

                  CALDomain3D::heigh = 256

                  (making domain Thread Group count  eq. to  1024).

                   

                  HD4600 seems to be fine (with delcartion of thread per group == 64), while attr. call return the waveformSize eq. to 32.

                   

                  This way 1 Thread Group is size of 64, but there are only 32 execution units.

                  How it works?

                   

                  B.2) Also, by saying “Thread groups are scheduled on SIMD's until the SIMD cannot hold any more thread groups and then it waits for more resources to be cleared by the execution of a thread group finishing” do you mean that if 2 Thread Groups can fit into 1 SIMD then they will execute together?

                  For example, in case

                  dcl_num_thread_per_group 32

                   

                  CALProgramGrid.gridBlock::width = 32

                  CALProgramGrid.gridBlock::height = 1

                  CALProgramGrid.gridBlock:: depth = 1

                   

                  CALProgramGrid.gridSize::width = 2048

                  CALProgramGrid.gridSize::height = 1

                  CALProgramGrid.gridSize:: depth = 1

                   

                  With

                  CALdomain3D::width = 256

                  CALDomain3D::heigh = 256

                  (making domain Thread Group count  eq. to  1024).

                   

                  then 2 Thread Groups will be executing on 1 SIMD of 7xx (64 exec. units per SIMD), unless address declared as absolute?

                   

                  E)

                  In case of example

                  dcl_num_thread_per_group 64

                   

                  CALProgramGrid.gridBlock::width = 64

                  CALProgramGrid.gridBlock::height = 1

                  CALProgramGrid.gridBlock:: depth = 1

                   

                  CALProgramGrid.gridSize::width = 1024

                  CALProgramGrid.gridSize::height = 1

                  CALProgramGrid.gridSize:: depth = 1

                   

                   

                  With

                  CALdomain3D::width = 256

                  CALDomain3D::heigh = 256

                  (making domain Thread Group count  eq. to  1024).

                   

                  Does it mean that Thread Groups will be allocated 1 per SIMD in round robin fashion through SIMDs, and if one of the SIMDs will take longer time to execute (due-to code branch, for example) it will slow down whole dispatch procedure, possibly waiting on SIMD to finish before scheduling rest of SIMDs, or dispatcher schedule SIMDs on available basis?