If I call clEnqueueNDRangeKernel(...) with a local size of NULL, is there any way to find out how the hardware has decided to utilise the work groups, i.e. how many work items (kernel instances) are running in each group? I've had a look at the stats in CodeXL but I don't understand a lot of what is being reported. I'm assuming that what I'm looking for is buried somewhere in all those numbers.
(On a related note, if work_dim is 2 and local size is set to (1, 1), would this effectively result in only one kernel instance running in each workgroup?)
Thank you for the query.
If local-size is NULL, then the OpenCL implementation determines how to be break the global work-items into appropriate work-group instances. It also depends on the target device where the kernel will be executed.
clGetKernelWorkGroupInfo can be used to query information about the kernel object that may be specific to a device. Calling this API with parameter CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, returns the preferred work-group size for that kernel for a specific device. This value can be a good hint to find out the default work-group size that the implementation may choose for that kernel for that device if no local-size is specified.
Using CodeXL, the work-group size used for each kernel can be found under column " WorkGroupSize" in "GPU Performance Counters" profiling report. For more information, please see "CodeXL User Guide (Help)-> Using CodeXL > GPU Profiler > Using the GPU Profiler > GPU Profiler Performance Counters Session"
On a related note, if work_dim is 2 and local size is set to (1, 1), would this effectively result in only one kernel instance running in each workgroup?
Yes, it means each work-group has only one work-item to execute.