each compute unit provides 16384 GP registers, and each register contains 4x32-bit values. The total register size is 256 kB of storage per compute unit. These registers are shared among all active wavefronts on the compute unit; each kernel allocates only the registers it needs from the shared pool. This is unlike a CPU, where each thread is assigned a fixed set of architectural registers. However, using many registers in a kernel depletes the shared pool and eventually causes the hardware to throttle the maximum number of active wavefronts .
Wenju is correct that the GPU may be reducing the number of active wavefronts due to the resource requirements / constraints.
Thank you for your answers.
I've a question that naturally follows your answers. The number of wavefronts and workgroups per CU and per GPU is also limited by design. For example, 24.8 wavefronts per CU on the 5870.
Now, suppose that I do a vector addition, with two 16M float vectors (64MB). Each work item sums two elements, so the global size is 16M work items. This means 16M / 64 = 256K wavefronts.
If we actually ignore workgroup size, 256K wavefronts means 256K / 20 = 13K wavefronts per CU, which is much bigger than the 24.8 limit.
Summarizing, working on big data often leads to an high number of wavefronts per CU, much bigger than the design limit. Anyway, the algorithms run correctly. So, how do OpenCL and GPU infrastructure handle this? Do they split the kernel into identical sub kernels operating on smaller (sequential) amount of data or what?
If the job doesn't fit the device, it just runs it in batches until it's finished. On nvidia this shows up in the profiler, not sure about amd.
This is pretty important as for example it allows the same code to run on any device.
cadorino, I agree with notzed explanation