Since drivers 13.1 and later, the AMD OpenCL compiler has been rather sparse with allocating registers for my code, and as a result there is massive register spilling and about 3x reduction in code performance.
When compiled with 12.11 or older it would use 244 registers, with no spilling. Still enough registers to achieve full utilization of the GPU since with no register spilling my code would be nearly wholly arithmetically bound while having enough wavefronts to keep the GPU occupied.
But when compiled with 13.1 or later it uses only 131, spilling a lot of registers.
Are there any compiler flags i can pass to allow/force the compiler to be more liberal in allocating registers?
My apologies in advance if i've missed any documents specifying this, or if this question has already been answered (i searched but couldn't find any entirely similar questions).
The specific code in question can be found here:
Run thought CodeXL should give a complete .cl file
the loop at lines 246-250 in gpu.cl and the whole of des.cl/sboxes.cl is the relevant runtime-critical section.
performance dropped to ~40 million from 125-130 million hashes / second with the new drivers. Edit: with the card 7850, i forgot to mention.
(The code also broke completely and generated wrong hashes with driver version 13.1, but this has been fixed in the newest beta driver. Mentioning this in case it might be related.)
seems somewhat relevant in terms of losing performance. But does not seem to be caused by register spilling.