OpenCL

fancyix · ‎11-16-2018

If I made an array like uint[128], the driver will spill it even if there is enough registers to store this array.

Any way I can do to let compiler store big array in registers? Maybe some compile option?

dipak · ‎11-19-2018

Currently, there is no compiler option that directly controls the register usage and register allocation. Generally, the compiler tries to optimize the register usage so that more number of wavefronts can be in-flight (which increases the gpu occupancy). Also, without knowing the work-group size, the compiler must assume an upper-bound size to avoid allocating more registers in the work-item than the hardware actually contains.

One way to hint the compiler is specifying a smaller work-group size at compile time (by reqd_work_group_size ) that allows the compiler to allocate more registers for each kernel, which can avoid spill code and improve performance. Please note, it is still a good idea to re-write the algorithm to use fewer registers and avoid allocating a large array in the private memory.

By the way, on GCN devices, the number of active wavefronts per SIMD = 256 / #VGPR used by the kernel [ assuming 4-byte data type].

In the above case, if the array is allocated in the registers, it is likely that the kernel uses more than 128 registers. Thus the wavefront per SIMD is 1 or occupancy is 10% only.

Thanks.

sp314 · ‎11-19-2018

I agree with dipak, and by the way, you did not specify what you're storing in this array. What exactly are you doing?

Still, here's an old trick - instead of storing the values in the array, try recalculating them every time.

The modern AMD GPUs are insanely fast, especially when dealing with integer ALU work, so my rationale is that perhaps that instead of storing something in the array, it will be faster to recalculate the value when you need it?

Accessing the memory is orders of magnitude slower than the ALU ops. If calculating the value that you're storing doesn't take too many registers, it could be the case that calculating it again will end up running more waves in parallel, resulting in better performance.

OpenCL

Store array in regs?