Hello!
Since drivers 13.1 and later, the AMD OpenCL compiler has been rather sparse with allocating registers for my code, and as a result there is massive register spilling and about 3x reduction in code performance.
When compiled with 12.11 or older it would use 244 registers, with no spilling. Still enough registers to achieve full utilization of the GPU since with no register spilling my code would be nearly wholly arithmetically bound while having enough wavefronts to keep the GPU occupied.
But when compiled with 13.1 or later it uses only 131, spilling a lot of registers.
Are there any compiler flags i can pass to allow/force the compiler to be more liberal in allocating registers?
My apologies in advance if i've missed any documents specifying this, or if this question has already been answered (i searched but couldn't find any entirely similar questions).
The specific code in question can be found here:
https://github.com/madsbuvi/MTY_CL/blob/master/readme.md
Run thought CodeXL should give a complete .cl file
the loop at lines 246-250 in gpu.cl and the whole of des.cl/sboxes.cl is the relevant runtime-critical section.
performance dropped to ~40 million from 125-130 million hashes / second with the new drivers. Edit: with the card 7850, i forgot to mention.
(The code also broke completely and generated wrong hashes with driver version 13.1, but this has been fixed in the newest beta driver. Mentioning this in case it might be related.)
edit:
http://devgurus.amd.com/message/1286728
seems somewhat relevant in terms of losing performance. But does not seem to be caused by register spilling.
try add
__attribute__((work_group_size_hint(64, 1, 1))) or __attribute__((reqd_work_group_size(64, 1, 1)))
Thanks for reporting it. I will try to reproduce it at our end. Is the testcase 32-bit or 64-bit. It contains DLLs so i assume you are using Windows. Win7 or Win8?
Thank you, but it made no difference.
It is compiled as 32-bit and links to the 32-bit libraries. I am running a 64-bit version of windows 7.
btw....I know you would have done.. but just asking for sake of it: "After the workgroup hint, you should be spawning 64 workitems per workgroup while launching the kernel". I hope you did that as well.
Yes, i have of course tried this.
Have you marked this as correct answer by accident? I can un-mark if you need.
Haha, yes, sorry about that.
Well, I see the same for my app too,
With 13.1 app started to cause driver restarts. Comparing ISA for too long running kernel I found that under 13.1 it uses only 5 registers while on 12.8 (where no driver restart) it uses 12 GPRs:
SQ_PGM_RESOURCES:NUM_GPRS = 5
vs
SQ_PGM_RESOURCES:NUM_GPRS = 12
So, register spilling inevitable under 13.1 that slows down kernel in such big degree that it causes driver restarts.