6 Replies Latest reply on Feb 15, 2014 10:45 PM by sandyandr

    GCN Scheduler scheme

    sandyandr

      Hi, would anybody explain me exact scheduler working scheme for GCN ? Especially in situation, when workgroup size is smaller then 256 (128, 64). I understand, that CU can handle several small workgroups at the same time, but in what order? Does scheduler pack each CU one after another by them tightly or distribute workgroups to whole line of CUs for balanced memory access (I think it should be a better way)? I really need to know that (memory access optimization purposes).

        • Re: GCN Scheduler scheme
          realhet

          Hi,

           

          Here are some docs:

          You can check [Southern Islands Series Instruction Set Architecture manual]. Especially chapter 2.

          A good block diagram of the GCN CU http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

          And here's a good 'timing diagram' of the CU: How GCN scheduling work?? | AMD Developer Forums

            • Re: GCN Scheduler scheme
              sandyandr

              Thank you. Have read it, but still don't understand. What if I have workgroup of size {128,1,1}, and my global size is {128,64,1}? I believe, that on initially idle HD7970 all these work-items should execute at once (or almost at once because of wavefronts; as quick as possible - anyway). I'm sure, that all available SIMDs will be perfectly loaded and do their job, but, for instance, would first and second workgroups ([0,0,0] and [0,1,0]) work at the same CU? Or it will be [0,0,0] and [0,32,0] there? Or GCN fills CUs in some other way?

                • Re: GCN Scheduler scheme
                  realhet

                  Hi,

                   

                  local size = 128,1,1 means that a workgroup contains 128 workitems.

                  This workgroup made of 2 wavefronts. These 2 wavefronts will be scheduled on a patricular CU.

                  Workgroups will be scheduled one after each other but the CU which will execute them is resolved dynamically by the scheduler (there could be more than one kernel running at a time, so it's not a determined process).

                  One CU can hold 10 wavefronts in its queue but only executes 4 in it's ALUes.

                  So the only thing you can be sure of is that a workgroup (size = 64,128,192 or 256) will be executed on the same CU.

                    • Re: GCN Scheduler scheme
                      sandyandr

                      Thanks for the clarification. It's not good, actually. It means that only with workgroup-256 I can be sure that all CUs will use only predefined memory channels as I planned they should in order to avoid memory bank/channel conflicts. I know, that 256 is the best workgroup size in the most of my cases, but the whole amount of local memory (64K) couldn't be used by such group, so it's necessary to split group into two by 128. The one of the most inconvenient constraints, I think, - AMD should get rid of it in the future architecture releases.

                        • Re: GCN Scheduler scheme
                          realhet

                          Hi,

                           

                          256 workitems/CU is the minimum for ALU.

                          512 is better because you'll have memory/gds latency hiding. (Regardless of workgroup-size)

                           

                          Another thing that when you often use LDS and the workgroupSize is greater than 64, then you must synchronize LDS operations which can slow down performance. That's not a problem when workGroupSize=64.

                          1 of 1 people found this helpful
                            • Re: GCN Scheduler scheme
                              sandyandr

                              You are right, of' course. Sure, actual work size rarely happens to be 256 items / CU and nothing more.

                              Good notice about LDS-sync. For my work I face not much of that problem - I have nothing like small groups of items, storing and sharing their common data there. I fill workgoup-wide common areas of LDS seldom and only by several designated threads (items). But if I split workgroup into 64, I will need to duplicate such common areas - usually it's too expensive concerning LDS size. But it's for my case.

                              Thank you.