Archives Discussions

cadorino · ‎06-30-2012

Hi,

I'm collecting counters using GPUPerfAPI 2.9 and vector addition (Saxpy) on the A8 integrated GPU.

The kernel is very simple: y = a * x + y .

When the input size increases starting from 64KB to 4MB, the number of wavefronts increases in its turn. Anyway, when the input size is 8MB the number of wavefronts drops.

So I guess if this means that wavefronts are reused.

In addition, also other GPU counters drop in the middle of the input size range, as you can see in the file reported below.

Can you help explain this behavior?

Thank you very much!

Saxpy counters: http://www.gabrielecocco.it/Saxpy.htm

Wenju · ‎07-04-2012

hi cadorino,

each compute unit provides 16384 GP registers, and each register contains 4x32-bit values. The total register size is 256 kB of storage per compute unit. These registers are shared among all active wavefronts on the compute unit; each kernel allocates only the registers it needs from the shared pool. This is unlike a CPU, where each thread is assigned a fixed set of architectural registers. However, using many registers in a kernel depletes the shared pool and eventually causes the hardware to throttle the maximum number of active wavefronts .

View solution in original post

Wenju · ‎07-04-2012

hi cadorino,

each compute unit provides 16384 GP registers, and each register contains 4x32-bit values. The total register size is 256 kB of storage per compute unit. These registers are shared among all active wavefronts on the compute unit; each kernel allocates only the registers it needs from the shared pool. This is unlike a CPU, where each thread is assigned a fixed set of architectural registers. However, using many registers in a kernel depletes the shared pool and eventually causes the hardware to throttle the maximum number of active wavefronts .

plohrmann · ‎07-04-2012

Hello cadorino,

Wenju is correct that the GPU may be reducing the number of active wavefronts due to the resource requirements / constraints.

cadorino · ‎07-05-2012

Thank you for your answers.

I've a question that naturally follows your answers. The number of wavefronts and workgroups per CU and per GPU is also limited by design. For example, 24.8 wavefronts per CU on the 5870.
Now, suppose that I do a vector addition, with two 16M float vectors (64MB). Each work item sums two elements, so the global size is 16M work items. This means 16M / 64 = 256K wavefronts.

If we actually ignore workgroup size, 256K wavefronts means 256K / 20 = 13K wavefronts per CU, which is much bigger than the 24.8 limit.

Summarizing, working on big data often leads to an high number of wavefronts per CU, much bigger than the design limit. Anyway, the algorithms run correctly. So, how do OpenCL and GPU infrastructure handle this? Do they split the kernel into identical sub kernels operating on smaller (sequential) amount of data or what?

notzed · ‎07-05-2012

If the job doesn't fit the device, it just runs it in batches until it's finished. On nvidia this shows up in the profiler, not sure about amd.

This is pretty important as for example it allows the same code to run on any device.

Wenju · ‎07-06-2012

cadorino, I agree with notzed explanation

Archives Discussions

Are wavefronts reused?