8 Replies Latest reply on Dec 24, 2013 6:32 AM by nou

    default vs. specified work-group size; global size multiples




      The AMD OpenCL programming guide states that (5.6.3) explicitly specifying work-group sized instead of letting the OpenCL implementation choose it automatically (through local_work_size argument of NULL) is preferred. In how far is this a practical issue?

      Consider a relatively large number of threads to be launched, but no workgroups specified:


      someQueue.enqueueNDRangeKernel(someKernel, cl::NDRange(0), cl::NDRange(64000), cl::NullRange, 0, 0);


      obviously all work-items in the kernel are fully independent of each other, so no need for local memory, synchronizations of any kind etc.


      First question: can we reasonably expect this to be slower than, say, a scheme launching 1000 workgroups of size {1, 64} and the only kernel code difference being some additional calculation to get back the "original" work-item id (a value in [0, 63999])? My gut feeling would have been that it makes no difference (or, if anything, the extra kernel calculations make it slower) as the runtime cleverly splits the workitems across CUs anyway, but the manual makes me get doubts.


      Second, assume that the number of workitems in a workgroup is always an integer multiple of say 64, but the global work size is not (irrelevant of whether we use work-groups at all or not, so for the latter e.g. just take above example with say cl::NDRange(63997) for global size). Does this normally have a clear performance impact as compared to a scheme where the global work size is adjusted to be an integer multiple of say 64, and those work-items which are exceeding the "real" global work size number just do nothing (return immediately), Again, my gut feeling is it does not matter, as all except the very last wavefront make full utilization of hardware anyway (are groups of 64 workitems on a Tahiti device), but again I might be wrong.


      thanks !

        • Re: default vs. specified work-group size; global size multiples

          if you specify null range implementation must complain to same conditions as manual specifying that mean global work size is multiple of work group size. IMHO reason why is better specifies the work group size is you can make better assumption than runtime which can choose suboptimal number.


          for your example with 63997 work items runtime will must launch kernel with work group size == 1. there is no automatic padding to optimal size.

            • Re: default vs. specified work-group size; global size multiples

              well clearly the work group size must be 1. But does it make any difference regarding performance, and if so how device-dependent is the impact?


              just assume the kernel is making an ordinary (maths)-vector addition:

              i = get_global_id(0);

              c[i] = a[i] + b[i];


              and I launch as many work-groups of size 1 as there are elements in the vectors (n). Is this suboptimal compared to a launch with a work-group size of greater than 1? If n is not evenly divisible by say 64, does it mean the compute devices on a Tahiti would not be fully utilized? My understanding so far has been that they are fully utilized, as each CU will process a wavefront comprised of 64 workgroups each (each of size 1). Hence the runtime should be roughly equal to one where I pad my n to a multiple of 64, and then launch workgroups of say dim {1, 64} (or {8 * 8} or whatever). Is that assumption not correct?



                • Re: default vs. specified work-group size; global size multiples

                  the limit of wavefront on CGN architecture is 40. also it can execute only single wavefront at the same time. so you will get only 1/64 of maximum performance. multiple wavefronts are executed when you have light weight kernels to hide memory access latencies.

                    • Re: default vs. specified work-group size; global size multiples

                      Sorry I don't fully get this - but the issue is very important, because I don't want to limit myself to 1/64 of maximum performance in the worst case!


                      I understand that a CU is limited to 40 wavefronts; and it makes sense that a CU can only execute a single wavefront at a time. The question boils down, practically speaking, how work-items are scheduled to wavefronts:


                      Let's consider first the case where I do specify local work group sizes. My assumption has been that if these work group sizes are less than (or not multiples of) 64, then on a Tahiti device wavefronts are not fully utilized (i.e. it has less than 64 work items), and hence I am not achieving maximum performance. I suppose that is correct, isn't it?


                      The real question I am struggling with is how wavefronts are organized if I do not specify local work groups at all. My practical calculations have a relatively large number of elements (n) to be processed. n is fairly large (n > 20000), each element is calculated independently of every other element (no synchronization needs etc.), and the calculation done for each element is relatively expensive, though (mostly) the same for all elements and memory aligned well for coalesced access (lock-stepped execution etc. is hence no a major problem). n is a user input and presently not padded to anything.


                      Because each element is processed independently, I do not specify local work groups, just a global size. The launch invokation looks (C++ bindings) like this:


                      someQueue.enqueueNDRangeKernel(someKernel, cl::NDRange(0), cl::NDRange(n), cl::NullRange, 0, 0);


                      My assumption has been that the scheduler automatically places 64 (Tahiti) consecutive work-items into a single wavefront (because there are no local work group constraints at all), and hence each wavefront is completely utilized and I am more or less at maximum performance. If, however, the scheduler places only a single work-item into a single wavefront (because I am not specifying local work groups), then of course I am far away from maximum performance.

                      So which of the two scheduler behaviours applies? The former or the latter? Or might the behaviour even depend on additional factors, such as the value of n, the specific hardware device, or kernel code?


                      thanks !

                        • Re: default vs. specified work-group size; global size multiples

                          if you pass null range OpenCL runtime will try best to achieve best performance. in other word it will run as big work-group which is multiple of 64 as it can. if you enqueue 1071 workitems with null range OpenCL runtime should execute with 63 or 17 workgroup size. with null range you are telling OpenCL to take best guess that it can. one reason why you should specify workgroup size is that it will return error in case that global size is not divisible by work group size. it can run unnoticed as with 63997 when you get suboptimal performance.

                            • Re: default vs. specified work-group size; global size multiples

                              I see. So if n is say a prime number >2 the worst-case scenario is applicable and each wavefront is comprised on only a single work-item.

                              (I don't get why the runtime doesn't automatically pad n in terms of kernel launch to an ideal integer and signals the superfluous kernel instances that they are superfluous = no execution, but maybe that's a different story).


                              So I rearrange my code that n pads to 64 (Tahiti). Does it in terms of performance make any difference how I do that?


                              For example, I can keep my original code example (no work groups specified) and substitute n by m which is a multiple of 64:

                              someQueue.enqueueNDRangeKernel(someKernel, cl::NDRange(0), cl::NDRange(m), cl::NullRange);


                              Or I specify dummy-workgroups manually, e.g.:

                              someQueue.enqueueNDRangeKernel(someKernel, cl::NDRange(0, 0), cl::NDRange(m/64, 64), cl::NDRange(1, 64));

                              which gives me work-groups of dim {1, 64}, or

                              someQueue.enqueueNDRangeKernel(someKernel, cl::NDRange(0, 0), cl::NDRange(64, m/64), cl::NDRange(64, 1));

                              which gives me work-groups of dim {64, 1};


                              [of course in each case the kernel code first checks if the work-item is a padding-work-item "out-of-bounds" and if so returns immediately].


                              Is there any good reason for preferring the one or the other? And all variants should yield optimal performance, isn't it?

                              Thanks !

                    • Re: default vs. specified work-group size; global size multiples

                      Just a side-note: The restriction that the global work size has to be a multiple of the local work size is going to be removed with OpenCL 2.0. Instead the behavior will be that an additional kernel call with work size global%local is made.


                      I believe the NVidia OpenCL implementation didn't require the global size to be a multiple of the local one last time I checked. Although this is of course incorrect behavior according to the OpenCL <=1.2 specs.