Performance, Workgroup size

Discussion created by Tasp on Apr 13, 2010
Latest reply on Aug 12, 2010 by jeff_golds

This is from the documentation of the c++ bindings:


global     describes the number of global work-items in will execute the kernel function. The total number of global work-items is computed as global_work_size[0] * ... * global_work_size[work_dim - 1].

local     describes the number of work-items that make up a work-group (also referred to as the size of the work-group) that will execute the kernel specified by kernel.


If local is NullRange and no work-group size is specified when the kernel is compiled, the OpenCL implementation will determine how to break the global work-items specified by global into appropriate work-group instances. The work-group size to be used for kernel can also be specified in the program source using the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier.

Now I just set "local" to NullRange, but this leads to bad performance with Intel Core2 Duo @ 3.0GHz beeing faster than HD4850 on kernels that do mostly convolutions.

From the convolution example:

In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.


While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32x32).

Now on a CPU device I get:


Max compute units:                 2
  Max work items dimensions:             3
    Max work items[0]:                 1024
    Max work items[1]:                 1024
    Max work items[2]:                 1024
  Max work group size:                 1024

On the HD4850 it's 200 compute units and size 256 instead of 1024 (if I remember correctly).

My questions is now how to choose the local work group size for best performance if I want to do simple convolutions on images ranging from 100x100 to 2000x2000?