cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Tasp
Journeyman III

Performance, Workgroup size

This is from the documentation of the c++ bindings:

global     describes the number of global work-items in will execute the kernel function. The total number of global work-items is computed as global_work_size[0] * ... * global_work_size[work_dim - 1].

local     describes the number of work-items that make up a work-group (also referred to as the size of the work-group) that will execute the kernel specified by kernel.



If local is NullRange and no work-group size is specified when the kernel is compiled, the OpenCL implementation will determine how to break the global work-items specified by global into appropriate work-group instances. The work-group size to be used for kernel can also be specified in the program source using the __attribute__((reqd_work_group_size(X, Y, Z))) qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier.


Now I just set "local" to NullRange, but this leads to bad performance with Intel Core2 Duo @ 3.0GHz beeing faster than HD4850 on kernels that do mostly convolutions.

From the convolution example:

In the above call, we also need to pass in a workgroup size. During computation, items within a work-group can share certain data and avail of some synchronization mechanisms tha t are not available to items across workgroups. We do not need any of those features in our current kernel, so it is tempting to use a workgroup of size 1.

 

While that will work in principle and produce correct results, that can produce bad performance. There are many considerations while choosing the appropriate workgroup size, including which device (CPU or GPU) the kernel is to be run on. We will not go into those details in this writeup; for our runs on the CPU device, we will use the largest possible workgroup size (32x32).

Now on a CPU device I get:

Max compute units:                 2
  Max work items dimensions:             3
    Max work items[0]:                 1024
    Max work items[1]:                 1024
    Max work items[2]:                 1024
  Max work group size:                 1024


On the HD4850 it's 200 compute units and size 256 instead of 1024 (if I remember correctly).

My questions is now how to choose the local work group size for best performance if I want to do simple convolutions on images ranging from 100x100 to 2000x2000?

0 Likes
18 Replies
n0thing
Journeyman III

On ATI GPU's your local group size should atleast be 64 and multiple of that.

So only choices you have - 64, 128 or 256

If you use a lot of shared memory resource than your active groups per SIMD will be less and latency hiding of memory operations will suffer, hence you should use a lesser work-group size in that case.

0 Likes
omkaranathan
Adept I

Tasp,

Workgroup size is limited by the number of registers used per thread and the local memory used per workgroup. You have to keep both of these minimal to get the best workgroup size. clGetKernelWorkGroupInfo API call will give you the maximum workgroup size that can be used to execute your kernel. Its preferable to have multiples of 64 as the workgroup size.

0 Likes

Thank you for the input.
For 2d kernels this would mean:

cpu: 1024 = NDRange(32, 32)

hd4850: 256 = NDRange(16, 16)

?

0 Likes

Yes.

0 Likes

Also, you should use the attribute "reqd_work_group_size" when you know what work group size you will use when you dispatch the kernel.  This will allow the compiler to optimize more effectively.

0 Likes

I understand it like this:

 

Example image 800x600 pixels.

size_t globalRange = {800,600};

size_t localRange = {32,24}; // on max workgroupsize 1024

 

I'll think 25 Work Items get started to do the work. (25 * 32 = 800) (24 * 25 = 600)

 

Regards,

Joerg

0 Likes

Originally posted by: masm32 I understand it like this:

 

 Example image 800x600 pixels.

 

size_t globalRange = {800,600};

 

size_t localRange = {32,24}; // on max workgroupsize 1024

 

 I'll think 25 Work Items get started to do the work. (25 * 32 = 800) (24 * 25 = 600)

 

 



Masm32,

     Not 25 work items. actually 25 workgroups each having 32 * 24 work items.

0 Likes

Hi

Just want to clarify further on workgroup size and compute unit(SIMD) so as to remove my confusion.

HD5850 has 18 compute units,my questions are based on above example

1.Each work group is executed in one compute unit right ?

2.Thats means at a time first 18 work groups out of total 625(25x25 )work groups will be executed in parallel and then next 18  work groups  and so on right?

3.Comming back to execution in single compute unit there are 16 stream cores/compute unit  and hence 16 kernel executions will happen in parallel in one SIMD right?

please clarify asap.

Thanks

Pavan

 

0 Likes

yes

0 Likes
Raistmer
Adept II

Then workgroup of single full wavefront (64 work items/threads) will take at least 4 clock cycles per instruction, instead of at least 1 clock per instruction as it would be for workgroup of only 16 threads. Right?
0 Likes

yes

0 Likes

Originally posted by: Raistmer Then workgroup of single full wavefront (64 work items/threads) will take at least 4 clock cycles per instruction, instead of at least 1 clock per instruction as it would be for workgroup of only 16 threads. Right?


I think Nou has been too fast with answer. Wavefront is smallest execution unit. So even if you try to run 16 workitems full wavefront is executed ( and unneeded results are discarded ).

To see impact of workgroup size on performance I propose to run peekflops example in CAL++ library ( latest svn version ). Kernel execution time for workgroup sizes 8-64 is exactly the same.

On 4xxx cards for workgroup size 128, execution time is also almost the same as for workgroup size 64 ( probably due to wavefront scheduling ). Worksize ~256 allows achieving almost full performance.

PS. Kernel in peekflops example is really heavy on computations and easy on registers. Other kernels might have too much register usage to allow >=256 workgroup size.

0 Likes

The effect is even more visible with slightly modified version of peekflops ( it test worksizes from 16 to 512 with step 16 ).

Here are results for 4770.

*** 1 wavefront *** Device 0: workgroup size 16 execution time 2320.28 ms, achieved 96.72 gflops Device 0: workgroup size 32 execution time 2315.40 ms, achieved 193.84 gflops Device 0: workgroup size 48 execution time 2309.81 ms, achieved 291.47 gflops Device 0: workgroup size 64 execution time 2348.95 ms, achieved 382.15 gflops *** 2 wavefronts *** Device 0: workgroup size 80 execution time 2354.55 ms, achieved 476.55 gflops Device 0: workgroup size 96 execution time 2354.54 ms, achieved 571.86 gflops Device 0: workgroup size 112 execution time 2354.54 ms, achieved 667.17 gflops Device 0: workgroup size 128 execution time 2354.54 ms, achieved 762.48 gflops *** 3 wavefronts *** Device 0: workgroup size 144 execution time 2936.13 ms, achieved 687.88 gflops Device 0: workgroup size 160 execution time 2941.72 ms, achieved 762.86 gflops Device 0: workgroup size 176 execution time 2936.13 ms, achieved 840.74 gflops Device 0: workgroup size 192 execution time 2930.54 ms, achieved 918.93 gflops *** 4 wavefronts *** Device 0: workgroup size 208 execution time 3892.39 ms, achieved 749.50 gflops Device 0: workgroup size 224 execution time 3892.40 ms, achieved 807.16 gflops Device 0: workgroup size 240 execution time 3892.38 ms, achieved 864.81 gflops Device 0: workgroup size 256 execution time 3892.38 ms, achieved 922.47 gflops *** 5 wavefronts *** Device 0: workgroup size 272 execution time 5424.35 ms, achieved 703.31 gflops Device 0: workgroup size 288 execution time 5430.10 ms, achieved 743.89 gflops Device 0: workgroup size 304 execution time 5424.76 ms, achieved 785.99 gflops Device 0: workgroup size 320 execution time 5430.54 ms, achieved 826.48 gflops *** 6 wavefronts *** Device 0: workgroup size 336 execution time 6733.01 ms, achieved 699.93 gflops Device 0: workgroup size 352 execution time 6732.02 ms, achieved 733.37 gflops Device 0: workgroup size 368 execution time 6707.62 ms, achieved 769.49 gflops Device 0: workgroup size 384 execution time 6708.28 ms, achieved 802.87 gflops *** 7 wavefronts *** Device 0: workgroup size 400 execution time 7715.53 ms, achieved 727.14 gflops Device 0: workgroup size 416 execution time 7715.81 ms, achieved 756.20 gflops Device 0: workgroup size 432 execution time 7713.82 ms, achieved 785.49 gflops Device 0: workgroup size 448 execution time 7706.35 ms, achieved 815.37 gflops *** 8 wavefronts *** Device 0: workgroup size 464 execution time 7721.38 ms, achieved 842.85 gflops Device 0: workgroup size 480 execution time 7717.20 ms, achieved 872.38 gflops Device 0: workgroup size 496 execution time 7716.08 ms, achieved 901.59 gflops Device 0: workgroup size 512 execution time 7682.24 ms, achieved 934.78 gflops

0 Likes
Raistmer
Adept II

Interesting info.
That is, actually no way to get 1clock per single operation, the only possibility is to get 4 clocks for 4 operations. And if no 4 operations are needed, GPU will underperform.
0 Likes

Hi

Adding some more doubts with ref to above example

1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does

                            a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)

                                                        OR

                            b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances)

which is correct?

b.Wavefront is defined for single Compute Unit right? and not for the complete GPU 18 compute units.

c.As a whole for 18 compute units   1152(18x64) work items would be executing in parallel right?

 

2.Each  work-group will execute in single compute unit  right? there won't be  any distribution of  wavefronts to other compute unit. For Examples for a work-group size of 256 ,there will be 4 wavefronts and each wavefront will get executed one after the other in same compute unit right?

Thanks for the patience in reading my questions and thanks in advance for clarifying the same ..


--Pavan

0 Likes

bringing to top.

please anyone clarify my doubts in inline

Thanks

Pavan

0 Likes

Originally posted by: pavandsp Hi Adding some more doubts with ref to above example 1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

Yes you are right.

a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does                             a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)                                                         OR                             b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances) which is correct?

64 work-items run parallel as follows

First 16 work-items will be executed during first clock-cycle

Next 16 work-items will be executed during second clock-cycle

Next 16 work-items will be executed during third clock-cycle

Last 16 work-items will be executed during fourth clock-cycle

and Each stream core is able to execute 5 VLIW.

Suppose if you have 16 work-items, 3 clock-cycles are wasted because you don't have enough work-items

b.Wavefront is defined for single Compute Unit right? and not for the complete GPU 18 compute units.

  Yes wavefront is defined for compute unit.  but usually GPU contains symmetric compute units so you can say This GPU has wavefront size 64.

c.As a whole for 18 compute units   1152(18x64) work items would be executing in parallel right?  

Yes 1152 work-items will be executed parallel but it takes four clock-cycles.

2.Each  work-group will execute in single compute unit  right? there won't be  any distribution of  wavefronts to other compute unit. For Examples for a work-group size of 256 ,there will be 4 wavefronts and each wavefront will get executed one after the other in same compute unit right? Thanks for the patience in reading my questions and thanks in advance for clarifying the same .. --Pavan

 

Yes

0 Likes

Originally posted by: genaganna
Originally posted by: pavandsp Hi Adding some more doubts with ref to above example 1.Wavefront of 64  defines "execution of 64 work-items or 64 kernel instances per compute unit(SIMD Engine) at a time right ?if yes then

Yes you are right.

a.How come 64 work-items execute in parallel when u have 16 stream cores?so is it 16 or 64 work-items in parallel. I am confused here because ... whether all four Processing Elements in stream core does                             a.Process a 4 VLIW instructions in parallel from single kernel instance offcourse for 4 cycles(i.e 16 parallel kernel instances)                                                         OR                             b. Process a 4 VLIW instructions from each  4 kernel instances(i.e 16x4 64 parallel kernel instances) which is correct?

64 work-items run parallel as follows

First 16 work-items will be executed during first clock-cycle

Next 16 work-items will be executed during second clock-cycle

Next 16 work-items will be executed during third clock-cycle

Last 16 work-items will be executed during fourth clock-cycle

and Each stream core is able to execute 5 VLIW.

Suppose if you have 16 work-items, 3 clock-cycles are wasted because you don't have enough work-items



There appears to be some confusion here.

First, if you have 16 stream cores, then you have a chip I've never heard of

Second, each stream core (i.e compute unit) works on a wavefront granularity (in OpenCL, you may require multiple wavefronts for a single group).  A wavefront is usually 64 threads (32 on HD5400 aka Cedar).

A wavefront takes 4 clocks to execute a single VLIW instruction.  Thus, those 64 (or 32) threads all run concurrently over 4 clocks.  This gives an average of 16 (or 😎 threads per clock.  It's a bit more complicated when dealing with fetches, so I won't go into that here.

If you have 16 stream cores, then you can execute up to 16 wavefronts simultaneously, meaning you can execute 256 threads per clock on average.

Normally it's sufficient to consider average threads per clock, but you need to consider larger groups of threads when dealing with LDS sharing, etc.

 

Jeff

0 Likes