10 Replies Latest reply on Oct 10, 2012 2:45 AM by spectral

    Kernel occupancy and workgroup size

    cadorino

      Hi to everybody.

      I'm developing a benchmark to estimate the completion time of integrated and discrete GPUs considering the amount of operations executed per byte transferred.

      The kernel is very simple and "useless". Simply put, each thread reads the same constant argument and adds this value to an accumulator variable a certain number of times.

       

      What I'm a little bit surprised to discover is the kernel occupancy by varying the global size and the work group size.

      In particular, I set the global size to 256K and the work group size to 64. On the 7970 the occupancy is 100%. On the A8-3850 (Llano) the occupancy is 25%. If i double the work group size (128) the occupancy of the integrated GPU becomes 50%.

      Can you help me to understand why it is so?

       

      Thank you!

        • Re: Kernel occupancy and workgroup size
          LeeHowes

          The Evergreen GPUs like that in Llano I think were limited to a small number of workgroups per core. 8, maybe. SI is limited by either the number of wavefronts (40 per core) or the number of barriers (8). Note that barriers == workgroups for > 1 wavefront in the group, but if there is only one wavefront in the group that group uses 0 barriers, hence the barrier limit becomes irrelevant and you are wavefront limited. So my guess here is that you use few resources per wave, and with 64 WIs per group you have one wave per group, getting only 8 waves per core which doesn't cover latency. Double the group size and you double that to 16, while on SI you always had a lot more so it didn't matter.

            • Re: Kernel occupancy and workgroup size
              cadorino

              The fact is that the kernel is really simple. Actually no memory accesses, no barrier. Here it is:

              kernel void

              runOps(global float* input, global float* output, uint iterations, uint return_id, float value)

              {

                        uint gid = get_global_id(0);

                        float temp = 0;

                        float element = value;

               

                        for(uint i = 0; i < iterations; i++) {

                  temp += element;

                        }

                        if(gid == return_id)

                  output[0] = temp;

              }

               

              Moreover, I'm trying to understand why the APU's GPU takes 7 milliseconds to run it when iterations = 512. It seems too much time.

                • Re: Kernel occupancy and workgroup size
                  LeeHowes

                  It'll be control flow bound. Performance wasn't something you brought up in your original post, though, so looking at this purely from an occupancy point of view you will always be bounded by something, and in this case likely by the number of workgroups which will indirectly limit your number of wavefronts and hence reduce your occupancy.

                   

                  There is a chapter on this in the new edition of Heterogeneous Computing with OpenCL, btw, which should be released soon.

                • Re: Kernel occupancy and workgroup size
                  cadorino

                  I'm reasoning about what you say: Note that barriers == workgroups for > 1 wavefront in the group

                  So, all the wavefronts of a workgroup must be executed before a wavefront of another group scheduled on the same CU can run. Is this what you mean?

                    • Re: Kernel occupancy and workgroup size
                      LeeHowes

                      No. I'm saying that when the occupancy calculation is performed what you want to do is maximise the number of wavefronts (because that gives you good use of registers, given that you don't use local memory). You are limited either by the number of wavefronts you can run, which will be limited on SI by the number 40, or by the number of registers each uses. Or alternatively you are limited by the number of workgroups you can run. On earlier architectures, like Llano, there was a limit on the number of workgroups. On SI the limit is on the number of barrier resources the runtime can allocate, and a barrier resource is needed for any workgroup with more than one wavefront (single wave groups optimise away barriers because there is no synchronization to perform). That change is the reason you will see a difference, I think. The workgroup limit no longer applies on SI so you hit 100%, but on Llano you can only have 8 groups, hence 8 waves, and that isn't using enough of the resources to give you a good occupancy ratio.

                  • Kernel occupancy and workgroup size
                    cadorino

                    Thank yo very much. This is very clear. What I'm still wondering is what happens if I run more wavefronts than the amount the hardware can handle on the fly. With 256K threads I have 4K wavefronts. On Llano, considering 5 CUs, this means 819 wavefronts per CU. Even if the limit is 40, the kernel can run. How does the GPU handle execution of such a number of wavefronts? I can suppose it uses batches, so it runs 40 wavefronts, waits the end, then is schedules the next 40 wavefronts. In this case, I guess which is the performance impact of having more wavefronts of the maximum allowed by SI.

                    • Kernel occupancy and workgroup size
                      cadorino

                      Well now, so I wonder, what does it means "the wavefronts per CU are limited to X" (e.g. 40)? And what happens if I have more than 40 wavefronts per CU given the global/workgroup size I set up? Thank you again!

                        • Re: Kernel occupancy and workgroup size
                          settle

                          The limit of wavefronts per CU refers to the number of wavefronts that are considered active (or resident, I forget the exact term).  Active wavefronts are those that are issuing a fetch, ALU, or store instruction.  If there are more wavefronts than the limit allows, then the excess wavefronts wait in a queue until an active wavefront is retired, and then a queued wavefront becomes active.