Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

How do I get the number of work groups?

What is the best way to know the number of work groups used by kernel?

17 Replies

I think clGetKernelWorkGroupInfo() API must help

Note that this API will return what is the size that is safe to use for that device.

While enqueuing a kernel, one can specify anything that one wants.

So, that way, the actual workgroup size with which a kernel is executing now is not a property of the cl_kernel object.

The application has to track all these.



clGetKernelWorkGroupInfo() returns the following:




How am I supposed to get work group number out of these?


IMHO, you are interested in knowing the number of workgroups inside the kernel. The API suggested above is useful while you are spawning workgroups from host side.

To get number of workgroups from kernel space use size_t get_num_groups (uint dimindx)

To get the workgroup number a thread belongs to use size_t get_group_id (uint dimindx)


I'd prefer to get group number from host side.

How do I use suggested API?


pbcpbt wrote:

I'd prefer to get group number from host side.

How do I use suggested API?

What do you mean by that? How can you get group number from host code, work-groups will only execute on device.

You can set the number of work-groups to whatever value you please in host code, or you can query the runtime to give you a favorable work-group size.


1. I want to know HOW MANY WORK GROUPS are being executed.

2. I want to know how can I set the number of work groups myself.


pbcpbt wrote:

1. I want to know HOW MANY WORK GROUPS are being executed.

Check the argument globalsize and localsize in clEnqueueNDRangeKernel function.

Number of Workgroups = globalSize / local Size

2. I want to know how can I set the number of work groups myself.

You can set the same values for setting the number of workgroups. Keep in mind, that local size perfectly divides globalSize.


As far as I understand the number of work groups can not be more than maximum work group number for any given device.

If I use the "Number of Workgroups = globalSize / local Size" equation as you suggest I get some unreal results.

For example I have globalSize around 1,000,000,000 (2^30 to be exact) and maximum local Size is 256. Your equation gives the Number of Workgroups = 4,194,304. One of my devices has maximum number of workgroups 248, the other 256.


Where are you getting these maximum workgroup numbers?

My guess about your numbers is that a tahiti card with 32 cores can have 8 workgroups per core due to barrier resources, so it would be able to support 256 concurrent workgroups if they had more than one wavefront each. It could support thousands of concurrent single-wavefront workgroups, though, because the workgroup consumes no significantly limited resources in that case over what the wavefront itself does.

If the runtime lets you launch a billion work items, then you will get enough workgroups to satisfy that. They won't all be concurrent and instead will have their entry onto the device serialized such that more will be launched as others  complete, freeing resources in the process, but they will be run eventually.


These numbers I get while running kernel analyzer.


pbcpbt wrote:

I'd prefer to get group number from host side.

How do I use suggested API?

To reiterate again, workgroup-size of the currently executing kernel instance is not a property of cl_kernel object that was used to launch it.

As said above, the global-work-items and local-work-items determine how many workgroups are spawned.

The GPU can run millions of workitems. However at any given time, only few of them can be active. As workgroups finishes, the hardware scheduler picks up remaining workgroups and execute them... Something like a queuing mechanism.

Hope this clarifies all your doubts,


Well, how do I know how many work groups are concurrently executing on my device and how to change that number?


I'm not sure you can tell how many are currently executing. You can't control it, either. Not directly anyway. The number of concurrent workgroups is a result of:

1) The resources consumed by each workgroup

2) The amount of resources the machine posesses

3) The amount of other work the machine is doing

So if your kernel is the only one running (which is the usual state of things with current runtimes unless you plan otherwise) then it depends only on 1 and 2. The calculation for this is roughly the same as the calculations used for occupancy.

So, thinking of Tahiti:

1) You can only have up to 8 workgroups per CU

2) There is 64kBytes of LDS per CU. If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU.

2) There are 256 registers per SIMD unit (1/4 CU). If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit, or 64 for the CU.

4) There are 32 CUs so you can multiply the per-CU numbers up accordingly.

Beyond that there will be peaks as different limits take over. If your workgroups are large, you may be limited in the number of wavefronts because you can't fit half a workgroup on the CU, for example. So you can calculate all of this based on the resources that kernelanalyzer and the profiler report.

OpenCL doesn't guarantee any concurrency of workgroups, so it is generally unsafe to synchronize between them. As a result you don't usually need to know precisely how much concurrency you're going to get, only an estimate of how much so that you don't launch more workgroups than you need.

I'm using work group number in my code:

for(uint i = get_group_id(0); i < n; i += get_num_groups(0))

and don't feel comfortable without knowing its value.

Is there any way to printf() it from kernel? I tried several times with no luck - it gives me Segmentation fault.

By the way - kernelanalyzer gives me 13 as a work group number for Tahiti and 8 for Turks.


That number is the total number of groups in the dispatch, not the concurrent number. It's calculated as Himanshu said earlier:

"Check the argument globalsize and localsize in clEnqueueNDRangeKernel function.

Number of Workgroups = globalSize / local Size"

Or, if you want to think of it another way, decide how many work groups you want and how big you want each of them to be:

size_t numGroups = 100;

size_t localSize[] = {64};

size_t globalSize[1];

globalSize[0] = numGroups * localSize[0];

clEnqueueNDRangeKernel(....globalSize, localSize...);

I profiled my code with CodeXL and it shows 100% kernel occupancy and no limiting factors.

Does it mean there is nothing I can do in terms of performance increase?


It means you compute units have enough wavefronts to execute to saturate them. Number of wavefronts are not limited by GPRs or LDS usage of the kernel. (Click on "100%" in codeXL to see this information)

There can still be a lot of ways to improve performance:

1. Try optimizing global and local read/writes by using co-alesced access and avaoiding bank/channel conflicts.

2. Try using native instructions (eg. native_sin instead of sin)

3. Try to remove conditionals from kernel, and make sure minimum clause switches happen.

Well its a big list. Better refer to AMD OpenCL Programming Guide. Check out device specific optimization based on achitecture (VLIW or GCN)