5 Replies Latest reply on Jul 6, 2012 3:37 AM by Wenju

    Are wavefronts reused?

    cadorino

      Hi,

      I'm collecting counters using GPUPerfAPI 2.9 and vector addition (Saxpy) on the A8 integrated GPU.

      The kernel is very simple: y[i] = a * x[i] + y [i].

      When the input size increases starting from 64KB to 4MB, the number of wavefronts increases in its turn. Anyway, when the input size is 8MB the number of wavefronts drops.

      So I guess if this means that wavefronts are reused.

      In addition, also other GPU counters drop in the middle of the input size range, as you can see in the file reported below.

       

      Can you help explain this behavior?

      Thank you very much!

       

      Saxpy counters: http://www.gabrielecocco.it/Saxpy.htm

        • Re: Are wavefronts reused?
          Wenju

          hi cadorino,

          each compute unit provides 16384 GP registers, and each register contains 4x32-bit values. The total register size is 256 kB of storage per  compute unit. These registers are shared among all active wavefronts on the compute unit; each kernel allocates only the registers it needs from the shared pool. This is unlike a CPU, where each thread is assigned a fixed set of architectural registers. However, using many registers in a kernel depletes the shared pool and eventually causes the hardware to throttle the maximum number of active wavefronts .

          • Re: Are wavefronts reused?
            plohrmann

            Hello cadorino,

             

            Wenju is correct that the GPU may be reducing the number of active wavefronts due to the resource requirements / constraints.

            • Re: Are wavefronts reused?
              cadorino

              Thank you for your answers.

              I've a question that naturally follows your answers. The number of wavefronts and workgroups per CU and per GPU is also limited by design. For example, 24.8 wavefronts per CU on the 5870.
              Now, suppose that I do a vector addition, with two 16M float vectors (64MB). Each work item sums two elements, so the global size is 16M work items. This means 16M / 64 = 256K wavefronts.

              If we actually ignore workgroup size, 256K wavefronts means 256K / 20 = 13K wavefronts per CU, which is much bigger than the 24.8 limit.

               

              Summarizing, working on big data often leads to an high number of wavefronts per CU, much bigger than the design limit. Anyway, the algorithms run correctly. So, how do OpenCL and GPU infrastructure handle this? Do they split the kernel into identical sub kernels operating on smaller (sequential) amount of data or what?