4 Replies Latest reply on Sep 24, 2014 2:46 AM by skanur

    Need for more workgroups

    skanur

      Hi,

       

      I have two questions that I can explain using this scenario

       

      Scenario:

      Lets say I have a kernel that can be grouped as a workgroup of 64 work-items i.e. 1 wavefront. I get this number from clGetKernelWorkGroupInfo api of OpenCL. I assume this api calculates this based on register allocation. Also from the same api I can get the local memory usage by the kernel. Dividing the total local memory (x 2 for GCN arch) by kernel local memory usage, I get maximum workgroups I can fit per compute unit (CU). Subsequently I can get workgroups I can fit in gpu, lets call this number "workgroup-gpu".

       

      Question:

      1. I remember reading in the forums that only one workgroup executes at a time on CU. So how does extra workgroups/CU help hiding memory latency?
      2. Is there any other reason to put more than "workgroup-gpu" workgroups in the GPU, as the rest are executed sequentially?
        • Re: Need for more workgroups
          nou

          Well it helps hide memory latency because when you execute read instruction from first wavefront it can takes several cycles until it read from memory. But then it can execute same read instruction from second wavefront and third. So when the first wavefront takes the result of read instruction is ready.

            • Re: Need for more workgroups
              skanur

              I understand wavefront does that. But is it same with workgroup as well? If only workgroup is executed in compute unit at a time and if that workgroup consists of only one wavefront, then how does the GPU hide memory latency? By switching workgroups?

                • Re: Need for more workgroups
                  dipak

                  Hi skanur,

                   

                  Let me clarify few points.

                  • Work-items are processed in group called wavefront (WV). Wavefronts are like hardware threads. Each has own program counter and  can run independently of each others.
                  • A CU can have one or more work-groups for processing. Each work-group is divided into one or more wavefronts depending on work-group size.
                  • Each CU consists of one or more SIMD units. Each SIMD executes one instruction (VLIW instruction in case of VLIW architecture) from a wavefront at a time. For example, In GCN, each CU has four SIMDs  so, it can executes four wavefronts simultaneously.
                  • Each SIMD has a wavefront queue consisting of one or more in-flight wavefronts (may be from different work-groups or different kernels). For example, In GCN, the queue length is 10 so, max. 40 wavefronts can be in-flight in a CU.
                  • During wavefront scheduling, one wavefront is chosen from the queue depending on some rules and assigned to SIMD for execution. These in-flight wavefronts are used to hide latency.

                   

                  Regards,

                  1 of 1 people found this helpful