cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

himanshu_gautam
Grandmaster

Re: How do I get the number of work groups?

pbcpbt wrote:

I'd prefer to get group number from host side.

How do I use suggested API?

To reiterate again, workgroup-size of the currently executing kernel instance is not a property of cl_kernel object that was used to launch it.

As said above, the global-work-items and local-work-items determine how many workgroups are spawned.

The GPU can run millions of workitems. However at any given time, only few of them can be active. As workgroups finishes, the hardware scheduler picks up remaining workgroups and execute them... Something like a queuing mechanism.

Hope this clarifies all your doubts,

0 Kudos
Reply
pbcpbt
Adept I

Re: How do I get the number of work groups?

These numbers I get while running kernel analyzer.

0 Kudos
Reply
pbcpbt
Adept I

Re: How do I get the number of work groups?

Well, how do I know how many work groups are concurrently executing on my device and how to change that number?

0 Kudos
Reply
LeeHowes
Staff
Staff

Re: How do I get the number of work groups?

I'm not sure you can tell how many are currently executing. You can't control it, either. Not directly anyway. The number of concurrent workgroups is a result of:

1) The resources consumed by each workgroup

2) The amount of resources the machine posesses

3) The amount of other work the machine is doing

So if your kernel is the only one running (which is the usual state of things with current runtimes unless you plan otherwise) then it depends only on 1 and 2. The calculation for this is roughly the same as the calculations used for occupancy.

So, thinking of Tahiti:

1) You can only have up to 8 workgroups per CU

2) There is 64kBytes of LDS per CU. If each workgroup uses 16kBytes you may have up to 4 workgroups on the CU.

2) There are 256 registers per SIMD unit (1/4 CU). If each wavefront uses 16 registers you may have to 16 wavefronts on the SIMD unit, or 64 for the CU.

4) There are 32 CUs so you can multiply the per-CU numbers up accordingly.

Beyond that there will be peaks as different limits take over. If your workgroups are large, you may be limited in the number of wavefronts because you can't fit half a workgroup on the CU, for example. So you can calculate all of this based on the resources that kernelanalyzer and the profiler report.

OpenCL doesn't guarantee any concurrency of workgroups, so it is generally unsafe to synchronize between them. As a result you don't usually need to know precisely how much concurrency you're going to get, only an estimate of how much so that you don't launch more workgroups than you need.

pbcpbt
Adept I

Re: How do I get the number of work groups?

I'm using work group number in my code:

for(uint i = get_group_id(0); i < n; i += get_num_groups(0))

and don't feel comfortable without knowing its value.

Is there any way to printf() it from kernel? I tried several times with no luck - it gives me Segmentation fault.

By the way - kernelanalyzer gives me 13 as a work group number for Tahiti and 8 for Turks.

0 Kudos
Reply
LeeHowes
Staff
Staff

Re: How do I get the number of work groups?

That number is the total number of groups in the dispatch, not the concurrent number. It's calculated as Himanshu said earlier:

"Check the argument globalsize and localsize in clEnqueueNDRangeKernel function.

Number of Workgroups = globalSize / local Size"

Or, if you want to think of it another way, decide how many work groups you want and how big you want each of them to be:

size_t numGroups = 100;

size_t localSize[] = {64};

size_t globalSize[1];

globalSize[0] = numGroups * localSize[0];

clEnqueueNDRangeKernel(....globalSize, localSize...);

pbcpbt
Adept I

Re: How do I get the number of work groups?

I profiled my code with CodeXL and it shows 100% kernel occupancy and no limiting factors.

Does it mean there is nothing I can do in terms of performance increase?

0 Kudos
Reply
himanshu_gautam
Grandmaster

Re: How do I get the number of work groups?

It means you compute units have enough wavefronts to execute to saturate them. Number of wavefronts are not limited by GPRs or LDS usage of the kernel. (Click on "100%" in codeXL to see this information)

There can still be a lot of ways to improve performance:

1. Try optimizing global and local read/writes by using co-alesced access and avaoiding bank/channel conflicts.

2. Try using native instructions (eg. native_sin instead of sin)

3. Try to remove conditionals from kernel, and make sure minimum clause switches happen.

Well its a big list. Better refer to AMD OpenCL Programming Guide. Check out device specific optimization based on achitecture (VLIW or GCN)

0 Kudos
Reply