18 Replies Latest reply on Aug 12, 2010 6:40 PM by jeff_golds

    Performance, Workgroup size

    Tasp

      This is from the documentation of the c++ bindings:

       

      global     describes the number of global work-items in will execute the kernel function. The total number of global work-items is computed as global_work_size[0] * ... * global_work_size[work_dim - 1].

      local     describes the number of work-items that make up a work-group (also referred to as the size of the work-group) that will execute the kernel specified by kernel.



       

      If local is NullRange and no work-group size is specified when the kernel is compiled, the OpenCL implementation will determine how to break the global work-items specified by global into appropriate work-group instances. The work-group size to be used for kernel can also be specified in the program source using the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier.


      Now I just set "local" to NullRange, but this leads to bad performance with Intel Core2 Duo @ 3.0GHz beeing faster than HD4850 on kernels that do mostly convolutions.

      From the convolution example:

      In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.

       

      While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32x32).

      Now on a CPU device I get:

       

      Max compute units:                 2
        Max work items dimensions:             3
          Max work items[0]:                 1024
          Max work items[1]:                 1024
          Max work items[2]:                 1024
        Max work group size:                 1024


      On the HD4850 it's 200 compute units and size 256 instead of 1024 (if I remember correctly).

      My questions is now how to choose the local work group size for best performance if I want to do simple convolutions on images ranging from 100x100 to 2000x2000?

        • Performance, Workgroup size
          n0thing

          On ATI GPU's your local group size should atleast be 64 and multiple of that.

          So only choices you have - 64, 128 or 256

          If you use a lot of shared memory resource than your active groups per SIMD will be less and latency hiding of memory operations will suffer, hence you should use a lesser work-group size in that case.

          • Performance, Workgroup size
            omkaranathan

            Tasp,

            Workgroup size is limited by the number of registers used per thread and the local memory used per workgroup. You have to keep both of these minimal to get the best workgroup size. clGetKernelWorkGroupInfo API call will give you the maximum workgroup size that can be used to execute your kernel. Its preferable to have multiples of 64 as the workgroup size.

            • Performance, Workgroup size
              Raistmer
              Then workgroup of single full wavefront (64 work items/threads) will take at least 4 clock cycles per instruction, instead of at least 1 clock per instruction as it would be for workgroup of only 16 threads. Right?
                • Performance, Workgroup size
                  nou

                  yes

                  • Performance, Workgroup size
                    hazeman

                     

                    Originally posted by: Raistmer Then workgroup of single full wavefront (64 work items/threads) will take at least 4 clock cycles per instruction, instead of at least 1 clock per instruction as it would be for workgroup of only 16 threads. Right?


                    I think Nou has been too fast with answer. Wavefront is smallest execution unit. So even if you try to run 16 workitems full wavefront is executed ( and unneeded results are discarded ).

                    To see impact of workgroup size on performance I propose to run peekflops example in CAL++ library ( latest svn version ). Kernel execution time for workgroup sizes 8-64 is exactly the same.

                    On 4xxx cards for workgroup size 128, execution time is also almost the same as for workgroup size 64 ( probably due to wavefront scheduling ). Worksize ~256 allows achieving almost full performance.

                    PS. Kernel in peekflops example is really heavy on computations and easy on registers. Other kernels might have too much register usage to allow >=256 workgroup size.

                      • Performance, Workgroup size
                        hazeman

                        The effect is even more visible with slightly modified version of peekflops ( it test worksizes from 16 to 512 with step 16 ).

                        Here are results for 4770.

                        *** 1 wavefront *** Device 0: workgroup size 16 execution time 2320.28 ms, achieved 96.72 gflops Device 0: workgroup size 32 execution time 2315.40 ms, achieved 193.84 gflops Device 0: workgroup size 48 execution time 2309.81 ms, achieved 291.47 gflops Device 0: workgroup size 64 execution time 2348.95 ms, achieved 382.15 gflops *** 2 wavefronts *** Device 0: workgroup size 80 execution time 2354.55 ms, achieved 476.55 gflops Device 0: workgroup size 96 execution time 2354.54 ms, achieved 571.86 gflops Device 0: workgroup size 112 execution time 2354.54 ms, achieved 667.17 gflops Device 0: workgroup size 128 execution time 2354.54 ms, achieved 762.48 gflops *** 3 wavefronts *** Device 0: workgroup size 144 execution time 2936.13 ms, achieved 687.88 gflops Device 0: workgroup size 160 execution time 2941.72 ms, achieved 762.86 gflops Device 0: workgroup size 176 execution time 2936.13 ms, achieved 840.74 gflops Device 0: workgroup size 192 execution time 2930.54 ms, achieved 918.93 gflops *** 4 wavefronts *** Device 0: workgroup size 208 execution time 3892.39 ms, achieved 749.50 gflops Device 0: workgroup size 224 execution time 3892.40 ms, achieved 807.16 gflops Device 0: workgroup size 240 execution time 3892.38 ms, achieved 864.81 gflops Device 0: workgroup size 256 execution time 3892.38 ms, achieved 922.47 gflops *** 5 wavefronts *** Device 0: workgroup size 272 execution time 5424.35 ms, achieved 703.31 gflops Device 0: workgroup size 288 execution time 5430.10 ms, achieved 743.89 gflops Device 0: workgroup size 304 execution time 5424.76 ms, achieved 785.99 gflops Device 0: workgroup size 320 execution time 5430.54 ms, achieved 826.48 gflops *** 6 wavefronts *** Device 0: workgroup size 336 execution time 6733.01 ms, achieved 699.93 gflops Device 0: workgroup size 352 execution time 6732.02 ms, achieved 733.37 gflops Device 0: workgroup size 368 execution time 6707.62 ms, achieved 769.49 gflops Device 0: workgroup size 384 execution time 6708.28 ms, achieved 802.87 gflops *** 7 wavefronts *** Device 0: workgroup size 400 execution time 7715.53 ms, achieved 727.14 gflops Device 0: workgroup size 416 execution time 7715.81 ms, achieved 756.20 gflops Device 0: workgroup size 432 execution time 7713.82 ms, achieved 785.49 gflops Device 0: workgroup size 448 execution time 7706.35 ms, achieved 815.37 gflops *** 8 wavefronts *** Device 0: workgroup size 464 execution time 7721.38 ms, achieved 842.85 gflops Device 0: workgroup size 480 execution time 7717.20 ms, achieved 872.38 gflops Device 0: workgroup size 496 execution time 7716.08 ms, achieved 901.59 gflops Device 0: workgroup size 512 execution time 7682.24 ms, achieved 934.78 gflops

                    • Performance, Workgroup size
                      Raistmer
                      Interesting info.
                      That is, actually no way to get 1clock per single operation, the only possibility is to get 4 clocks for 4 operations. And if no 4 operations are needed, GPU will underperform.
                        • Performance, Workgroup size
                          pavandsp

                          Hi

                          Adding some more doubts with ref to above example

                          1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

                          a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does

                                                      a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)

                                                                                  OR

                                                      b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances)

                          which is correct?

                          b.Wavefront is defined for single Compute Unit right? and not for the complete GPU 18 compute units.

                          c.As a whole for 18 compute units   1152(18x64) work items would be executing in parallel right?

                           

                          2.Each  work-group will execute in single compute unit  right? there won't be  any distribution of  wavefronts to other compute unit. For Examples for a work-group size of 256 ,there will be 4 wavefronts and each wavefront will get executed one after the other in same compute unit right?

                          Thanks for the patience in reading my questions and thanks in advance for clarifying the same ..


                          --Pavan

                            • Performance, Workgroup size
                              pavandsp

                              bringing to top.

                              please anyone clarify my doubts in inline

                              Thanks

                              Pavan

                              • Performance, Workgroup size
                                genaganna

                                 

                                Originally posted by: pavandsp Hi Adding some more doubts with ref to above example 1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

                                 

                                Yes you are right.

                                a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does                             a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)                                                         OR                             b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances) which is correct?

                                64 work-items run parallel as follows

                                First 16 work-items will be executed during first clock-cycle

                                Next 16 work-items will be executed during second clock-cycle

                                Next 16 work-items will be executed during third clock-cycle

                                Last 16 work-items will be executed during fourth clock-cycle

                                and Each stream core is able to execute 5 VLIW.

                                Suppose if you have 16 work-items, 3 clock-cycles are wasted because you don't have enough work-items

                                b.Wavefront is defined for single Compute Unit right? and not for the complete GPU 18 compute units.

                                 

                                  Yes wavefront is defined for compute unit.  but usually GPU contains symmetric compute units so you can say This GPU has wavefront size 64.

                                c.As a whole for 18 compute units   1152(18x64) work items would be executing in parallel right?  

                                Yes 1152 work-items will be executed parallel but it takes four clock-cycles.

                                2.Each  work-group will execute in single compute unit  right? there won't be  any distribution of  wavefronts to other compute unit. For Examples for a work-group size of 256 ,there will be 4 wavefronts and each wavefront will get executed one after the other in same compute unit right? Thanks for the patience in reading my questions and thanks in advance for clarifying the same .. --Pavan

                                 

                                Yes

                                  • Performance, Workgroup size
                                    jeff_golds

                                     

                                    Originally posted by: genaganna
                                    Originally posted by: pavandsp Hi Adding some more doubts with ref to above example 1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

                                     

                                    Yes you are right.

                                    a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does                             a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)                                                         OR                             b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances) which is correct?

                                    64 work-items run parallel as follows

                                    First 16 work-items will be executed during first clock-cycle

                                    Next 16 work-items will be executed during second clock-cycle

                                    Next 16 work-items will be executed during third clock-cycle

                                    Last 16 work-items will be executed during fourth clock-cycle

                                    and Each stream core is able to execute 5 VLIW.

                                    Suppose if you have 16 work-items, 3 clock-cycles are wasted because you don't have enough work-items



                                    There appears to be some confusion here.

                                    First, if you have 16 stream cores, then you have a chip I've never heard of

                                    Second, each stream core (i.e compute unit) works on a wavefront granularity (in OpenCL, you may require multiple wavefronts for a single group).  A wavefront is usually 64 threads (32 on HD5400 aka Cedar).

                                    A wavefront takes 4 clocks to execute a single VLIW instruction.  Thus, those 64 (or 32) threads all run concurrently over 4 clocks.  This gives an average of 16 (or 8) threads per clock.  It's a bit more complicated when dealing with fetches, so I won't go into that here.

                                    If you have 16 stream cores, then you can execute up to 16 wavefronts simultaneously, meaning you can execute 256 threads per clock on average.

                                    Normally it's sufficient to consider average threads per clock, but you need to consider larger groups of threads when dealing with LDS sharing, etc.

                                     

                                    Jeff