I think clGetKernelWorkGroupInfo() API must help
Note that this API will return what is the size that is safe to use for that device.
While enqueuing a kernel, one can specify anything that one wants.
So, that way, the actual workgroup size with which a kernel is executing now is not a property of the cl_kernel object.
The application has to track all these.
clGetKernelWorkGroupInfo() returns the following:
How am I supposed to get work group number out of these?
IMHO, you are interested in knowing the number of workgroups inside the kernel. The API suggested above is useful while you are spawning workgroups from host side.
To get number of workgroups from kernel space use size_t get_num_groups (uint dimindx)
To get the workgroup number a thread belongs to use size_t get_group_id (uint dimindx)
I'd prefer to get group number from host side.
How do I use suggested API?
What do you mean by that? How can you get group number from host code, work-groups will only execute on device.
You can set the number of work-groups to whatever value you please in host code, or you can query the runtime to give you a favorable work-group size.
1. I want to know HOW MANY WORK GROUPS are being executed.
Check the argument globalsize and localsize in clEnqueueNDRangeKernel function.
Number of Workgroups = globalSize / local Size
2. I want to know how can I set the number of work groups myself.
You can set the same values for setting the number of workgroups. Keep in mind, that local size perfectly divides globalSize.
As far as I understand the number of work groups can not be more than maximum work group number for any given device.
If I use the "Number of Workgroups = globalSize / local Size" equation as you suggest I get some unreal results.
For example I have globalSize around 1,000,000,000 (2^30 to be exact) and maximum local Size is 256. Your equation gives the Number of Workgroups = 4,194,304. One of my devices has maximum number of workgroups 248, the other 256.
Where are you getting these maximum workgroup numbers?
My guess about your numbers is that a tahiti card with 32 cores can have 8 workgroups per core due to barrier resources, so it would be able to support 256 concurrent workgroups if they had more than one wavefront each. It could support thousands of concurrent single-wavefront workgroups, though, because the workgroup consumes no significantly limited resources in that case over what the wavefront itself does.
If the runtime lets you launch a billion work items, then you will get enough workgroups to satisfy that. They won't all be concurrent and instead will have their entry onto the device serialized such that more will be launched as others complete, freeing resources in the process, but they will be run eventually.