AFAIK, although each thread processor has 256 registers, the maximum number of private GPRs that can be used in a thread is 123. This is due to the ISA of the hardware uses only 7-bit for GPR addressing. And (according to the document) at least 4 GPRs are used as cluster temperory registers.
Therefore, if not limited by the LDS usage, you will get NumWaveFrontPerSIMD > 1.
In order to get full utilization of the GPU, two wavefronts need to execute in parallel. The compiler thus is limited to allocating half of the registers available for a single wavefront so that at least two wavefronts can always be executed.