I am currently re-writing some OpenCL code of mine and would like to split the work of the group to more waves in order to have more waves in flight. The code is a OpenCL 1.2 code (because it needs to be compatible with Nvidia GPUs as well). On Nvidias running more then 256 items is no problem as long as enough resources are available.
I read that it seems to be possible to run a larger work group OpenCL kernel when Null range is used or when compile time requirement work group size and submit time sizes agree.
Unfortunately I experience the following problem:
I submit a kernel that has attribute header
__attribute__((reqd_work_group_size(256, 2, 1)))
and my submit part looks like
cl_int err = queue.enqueueNDRangeKernel(kernel, cl::NDRange(0), cl::NDRange(8388608,2), cl::NDRange(256,2), NULL, NULL);
So I expect 32768 x 1 work groups with 256x2 items each. Interesting the returned error code is CL_SUCCESS (by the way, switching to 128, 4 gives invalid work group size)
I start the kernel with following two lines:
if (get_global_id(0) == 0) printf("Local Sizes: %d %d \n",get_local_size(0),get_local_size(1));
if (get_group_id(0) == 0) printf("Work Item Id: %d %d \n",get_local_id(0), get_local_id(1));
Interestingly I get 256 and 2 return from the first line, but work group 0 only prints me the local id's (x 0), all the (x 1) are missing and never executed.
My OpenCL platform identifies itself as "OpenCL 2.1 AMD-APP (2671.3)", GPU as "OpenCL 1.2 AMD-APP(2671.3)" (this is strange that it also does not report OpenCL 2.1 compatibility ...),
Its a Ubuntu 18.04 system with recent amdgpu-pro 18.30 running. The GPU is a RX 580 4G (plus a AMD A10 7850 iGPU that is also detected).
|Max work item dimensions||3|
|Max work item sizes||1024x1024x1024|
|Max work group size||256|
|Preferred work group size (AMD)||256|
|Max work group size (AMD)||1024|
|Preferred work group size multiple||64|
|Wavefront width (AMD)||64|
So, the OpenCL standard value and CL_DEVICE_MAX_WORK_GROUP_SIZE_AMD do not agree.
The kernel uses 33 registers (it compiles well in rga and CodeXL) and 21.0k local memory. So with 256 work items per group I can have only 3 waves per SIMD active while 512 would allow 6 - I hope to get better performance by this (at least on Nvidia the step from 256 to 512 helped a lot)
Any advice to get this running? Unfortunately using NULL local range and hope the the compiler just will give 512 work group size is not an option, because there are lot of hard-coded optimizations to the work group size in the code.
Thanks in advance.