Archives Discussions

cadorino · ‎10-09-2012

Hi to everybody.

I'm developing a benchmark to estimate the completion time of integrated and discrete GPUs considering the amount of operations executed per byte transferred.

The kernel is very simple and "useless". Simply put, each thread reads the same constant argument and adds this value to an accumulator variable a certain number of times.

What I'm a little bit surprised to discover is the kernel occupancy by varying the global size and the work group size.

In particular, I set the global size to 256K and the work group size to 64. On the 7970 the occupancy is 100%. On the A8-3850 (Llano) the occupancy is 25%. If i double the work group size (128) the occupancy of the integrated GPU becomes 50%.

Can you help me to understand why it is so?

Thank you!

LeeHowes · ‎10-09-2012

The Evergreen GPUs like that in Llano I think were limited to a small number of workgroups per core. 8, maybe. SI is limited by either the number of wavefronts (40 per core) or the number of barriers (8). Note that barriers == workgroups for > 1 wavefront in the group, but if there is only one wavefront in the group that group uses 0 barriers, hence the barrier limit becomes irrelevant and you are wavefront limited. So my guess here is that you use few resources per wave, and with 64 WIs per group you have one wave per group, getting only 8 waves per core which doesn't cover latency. Double the group size and you double that to 16, while on SI you always had a lot more so it didn't matter.

cadorino · ‎10-09-2012

The fact is that the kernel is really simple. Actually no memory accesses, no barrier. Here it is:

kernel void
runOps(global float* input, global float* output, uint iterations, uint return_id, float value)
{
          uint gid = get_global_id(0);
          float temp = 0;
          float element = value;
          for(uint i = 0; i < iterations; i++) {
    temp += element;
          }
          if(gid == return_id)
    output[0] = temp;
}

Moreover, I'm trying to understand why the APU's GPU takes 7 milliseconds to run it when iterations = 512. It seems too much time.

LeeHowes · ‎10-09-2012

It'll be control flow bound. Performance wasn't something you brought up in your original post, though, so looking at this purely from an occupancy point of view you will always be bounded by something, and in this case likely by the number of workgroups which will indirectly limit your number of wavefronts and hence reduce your occupancy.

There is a chapter on this in the new edition of Heterogeneous Computing with OpenCL, btw, which should be released soon.

cadorino · ‎10-09-2012

I'm reasoning about what you say: Note that barriers == workgroups for > 1 wavefront in the group

So, all the wavefronts of a workgroup must be executed before a wavefront of another group scheduled on the same CU can run. Is this what you mean?

LeeHowes · ‎10-09-2012

No. I'm saying that when the occupancy calculation is performed what you want to do is maximise the number of wavefronts (because that gives you good use of registers, given that you don't use local memory). You are limited either by the number of wavefronts you can run, which will be limited on SI by the number 40, or by the number of registers each uses. Or alternatively you are limited by the number of workgroups you can run. On earlier architectures, like Llano, there was a limit on the number of workgroups. On SI the limit is on the number of barrier resources the runtime can allocate, and a barrier resource is needed for any workgroup with more than one wavefront (single wave groups optimise away barriers because there is no synchronization to perform). That change is the reason you will see a difference, I think. The workgroup limit no longer applies on SI so you hit 100%, but on Llano you can only have 8 groups, hence 8 waves, and that isn't using enough of the resources to give you a good occupancy ratio.

cadorino · ‎10-09-2012

Thank yo very much. This is very clear. What I'm still wondering is what happens if I run more wavefronts than the amount the hardware can handle on the fly. With 256K threads I have 4K wavefronts. On Llano, considering 5 CUs, this means 819 wavefronts per CU. Even if the limit is 40, the kernel can run. How does the GPU handle execution of such a number of wavefronts? I can suppose it uses batches, so it runs 40 wavefronts, waits the end, then is schedules the next 40 wavefronts. In this case, I guess which is the performance impact of having more wavefronts of the maximum allowed by SI.

LeeHowes · ‎10-09-2012

OpenCL does not provide a way for you to synchronize between workgroups. The reason for this is that it should be able to stream groups through the hardware to allow the same kernel to scale between devices of different sizes. As one workgroup completes a replacement will be issued by the device to fill the gap until the entire kernel dispatch has completed. It will happen on a per-wavefront basis not in terms of large batches.

spectral · ‎10-10-2012

About occupancy,

A study has been done about registers occupancy, and the author present a very efficient way to visualize the registers.

This kind of tools will really help the OpenCL community in order to optimize the kernels.

It is a personal advice, but I think that it will be fine to integrate (and adapt) it for OpenCL and put it in CodeXL by example.

Feel free to contact the author if needed, I'm sure that he will be interested to share the code :

http://parse.ele.tue.nl/system/attachments/2/original/ODES_Analyzing%20CUDA's%20Compiler%20through%2...

cadorino · ‎10-09-2012

Well now, so I wonder, what does it means "the wavefronts per CU are limited to X" (e.g. 40)? And what happens if I have more than 40 wavefronts per CU given the global/workgroup size I set up? Thank you again!

settle · ‎10-09-2012

The limit of wavefronts per CU refers to the number of wavefronts that are considered active (or resident, I forget the exact term). Active wavefronts are those that are issuing a fetch, ALU, or store instruction. If there are more wavefronts than the limit allows, then the excess wavefronts wait in a queue until an active wavefront is retired, and then a queued wavefront becomes active.

Archives Discussions

Kernel occupancy and workgroup size