Thank you for the query.
If local-size is NULL, then the OpenCL implementation determines how to be break the global work-items into appropriate work-group instances. It also depends on the target device where the kernel will be executed.
clGetKernelWorkGroupInfo can be used to query information about the kernel object that may be specific to a device. Calling this API with parameter CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, returns the preferred work-group size for that kernel for a specific device. This value can be a good hint to find out the default work-group size that the implementation may choose for that kernel for that device if no local-size is specified.
Using CodeXL, the work-group size used for each kernel can be found under column " WorkGroupSize" in "GPU Performance Counters" profiling report. For more information, please see "CodeXL User Guide (Help)-> Using CodeXL > GPU Profiler > Using the GPU Profiler > GPU Profiler Performance Counters Session"
On a related note, if work_dim is 2 and local size is set to (1, 1), would this effectively result in only one kernel instance running in each workgroup?
Yes, it means each work-group has only one work-item to execute.
Thanks.