The requirement of evenly divisible global work-size has been relaxed in OpenCL 2.0. As per the OpenCL 2.0 spec (clEnqueueNDRangeKernel 😞
The values in local_work_size need not evenly divide the global_work_size in any dimension. In this case, any single dimension for which the global size is not divisible by the local size will be partitioned into two regions. One region will have workgroups that have the same number of work items as was specified by the local size parameter in that dimension. The other region will have work-groups with less than the number of work items specified by the local size parameter in that dimension. The global IDs and group IDs of the work items in the first region will be numerically lower than those in the second, and the second region will be at most one work-group wide in that dimension. Workgroup sizes could be non-uniform in multiple dimensions, potentially producing work-groups of up to 4 different sizes in a 2D range and 8 different sizes in a 3D range. |
As a result, in case of non evenly divisible global work-size, there will be few wave-fronts which will have less number of valid work-items than the actual size. During the wave-front execution, all the invalid wok-items will be masked out and only valid work-items will be executed similar to any partially filled wavefronts. Thus, only fraction of the SIMD unit will be utilized for those cases.
For optimization, one can use the flag "-cl-uniform-work-group-size" which requires that the global work-size be a multiple of the work-group size specified to clEnqueueNDRangeKernel and thus, it allows the compiler/implementation to perform certain optimizations.
Regards,