9 Replies Latest reply on May 20, 2011 8:59 AM by maximmoroz

    Local work size setting!

    Atmapuri

      Hi!

      I am testing this kernel:

          int gid = get_global_id(0);
          Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];

      If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.

      If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).

      So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?

      Thanks!
      Atmapuri

       

        • Local work size setting!
          MicahVillmow
          If you do not set the local work size, we use the default size which is device dependent but usually 256(or 64). By setting it to 1, you are effectively using only 1/[64|32|16]th of each wavefront on the machine.
            • Local work size setting!
              Atmapuri

              That I understand, but I cant set it to more than 1, if the vector length is a primer number. The clEnqueNDRangeKernel returns "invalid work group size" which is exactly what is documented. If I dont specify anything it works fine and fast. Does the driver internally has the ability to set work group size to something which is otherwise not accepted by clEnqueNDRangeKernel???

            • Local work size setting!
              MicahVillmow
              Atmapuri,
              Yes. The CL API applies restrictions that are not necessarily restrictions in hardware.
                • Local work size setting!
                  Atmapuri

                  I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.

                    • Local work size setting!
                      maximmoroz

                       

                      Originally posted by: Atmapuri I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.


                      Intel OpenCL Alpha also had that sentence, and I wondered how can I combine it with statement from AMD APP OpenCL Programming Guide:

                      it is recommended that the developer explicitly specify the global (#work-groups) and local (#work-items/work-group) dimensions, rather than rely on OpenCL to set these automatically (by setting local_work_size to NULL in clEnqueueNDRangeKernel).

                      I ended up setting local worksize basing on preferred workgroup size multiple parameter.

                        • Local work size setting!
                          Atmapuri

                          I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

                          For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

                            • Local work size setting!
                              maximmoroz

                               

                              Originally posted by: Atmapuri I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

                              For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

                              Definately it depends on the goal and/or the kernel. One of the benefits of OpenCL is the ability to compute on different devices without any changes made to the code. There are certainly some cases (projects/products) when this benefit from potential becomes real one. I would argue that there are a lot of such cases. For example when the kernel is designed to run mostly on GPU, but it should have the ability to run on CPU while performance on CPU is not high-priority issue at all.

                      • Local work size setting!
                        MicahVillmow
                        The local workgroup size is a algorithmic design decision. If you don't use local memory, then there is no point in setting it. If the algorithm is optimal for a single workgroup size, then set it at compile time with the reqd_workgroup_size attribute. If the algorithm can handle multiple sizes, then you set it at runtime.