I have a kernel that uses local memory to calculations.
In order to make things faster, I would like to use registers instead.
How do I go about doing this?
In OpenCL, the local and private memory are marked by Address Space Qualifier __local(or local) and __private (or private) respectively. Any object declared without any address space qualifier is allocated in the generic address space and till OpenCL 1.2, the generic address space name for arguments to a function in a program, or local variables of a function is __private.
As the bandwidth of private memory (as stored in registers) is faster than local memory (stored in LDS), the conversion from local to private memory can improve the performance. This conversion works fine if the usage of register within the kernel is low or up-to a certain limit. Otherwise this conversion may have negative impact on kernel or over all program performance due to following reasons:
1) During compilation, the OpenCL compiler tries to map private memory allocations to the pool of registers (GPRs) in the GPU. In the event GPRs are not available, private memory is mapped to the “scratch” region, which has the same performance as global memory. So, the performance may degrade significantly.
2) There is a limit for number of registers per compute unit and SIMDs depending on GPU architecture (see Appendix D: Device Parameters in AMD APP Programming Guide to know device specific limit). Too many usage of registers can limit the number of active wave-fronts per SIMD and/or CU (see section "Resource Limits on Active Wavefronts" in Chapter 5 and 6 in AMD APP Programming Guide ) and reduce the overall GPU occupancy.
Thanks, Dipak. It is too bad that registers don't get spilled into local memory.
So, to clarify, any local variables are allocated in registers? Is it then good
to reduce the number of variables in a kernel ?
Thank you, you helped the needy
As you know, the size of local memory is limited, so spilled registers are allocated into global memory.
As I've mentioned, local variables are usually allocated in registers but this is not true for variables like large structure, array etc. Reducing the number of variables (which are most likely to be place in registers) in a kernel helps to minimize the register usage and hence greater GPU occupancy can be achieved.
Retrieving data ...