1 Reply Latest reply on Sep 6, 2015 7:01 AM by dipak

    Questions about local_work_size and global_work_size


      According to https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html,the values specified in global_work_size must be evenly divisible by the corresponding values specified in local_work_size.However,I set local_work_size=256 and global_work_size=384 in my program.It debug with no error.

      I am using AMD APP SDK 3.0.My questions as follows:

      1. How does the driver deal with it when global_work_size can not be evenly divisible by loacl_work_size?

      2. Will it cause any problem about performance degradation?

        • Re: Questions about local_work_size and global_work_size

          The requirement of evenly divisible global work-size has been relaxed in OpenCL 2.0. As per the OpenCL 2.0 spec (clEnqueueNDRangeKernel ):

          The values in local_work_size need not evenly divide the global_work_size in any dimension. In this case, any single dimension for which the global size is not divisible by the local size will be partitioned into two regions. One region will have workgroups that have the same number of work items as was specified by the local size parameter in that dimension. The other region will have work-groups with less than the number of work items specified by the local size parameter in that dimension. The global IDs and group IDs of the work items in the first region will be numerically lower than those in the second, and the second region will be at most one work-group wide in that dimension. Workgroup sizes could be non-uniform in multiple dimensions, potentially producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range.


          As a result, in case of non evenly divisible global work-size, there will be few wave-fronts which will have less number of valid work-items than the actual size. During the wave-front execution, all the invalid wok-items will be masked out and only valid work-items will be executed similar to any partially filled wavefronts. Thus, only fraction of the SIMD unit will be utilized for those cases.


          For optimization, one can use the flag "-cl-uniform-work-group-size" which requires that the global work-size be a multiple of the work-group size specified to clEnqueueNDRangeKernel and thus, it allows the compiler/implementation to perform certain optimizations.