Hello all,
I have a pretty fat (GPR-wise) kernel. It uses 948 scratch registers. I have a feeling that some overflow happens in this kernel, as during the execution of the fat place (where I have high stack workload) it shares flow control decisions per whole wavefront (e.g. if I have some flow control return, then the whole wavefront returns with it as soon as any thread (thread#0?) in the wavefront hits this return). I don't use any local memory inside the kernel.
This kernel works correctly on both Nvidia and Intel archs. My GPU is one of HD7900 series, latest APP SDK and drivers.
Did anyone have similar problems with scratchpad size? I can attach the compiled assembler code from APP Profiler if necessary.
Anton
Just to confirm that: When I move my workload from GPRs to global memory (which is of course slower on many architectures), the bug disappears and the kernel starts working correctly.
It also works correctly (but even slower) if I reduce the local workgroup size to 1x1x1.
So it seems like there is some overflow in the scratchpad with ATI GPUs.
Anton
Hi Anton,
I think it's right. I think you should avoid using scratch registers, it has a bad effect on performance(Someone tells me). I think Nvidia and Intel also have this problem if the workload is big enough.
Hi dear Wenju,
Thank you for your answer. Though I believe this is a wrong behavior. As I mentioned, both Nvidia and Intel handle this situation (when the scratchpad is huge) gracefully. Also I believe it should be developer's decision between the performance and the maintainance burden of the code.
It's said that AMD is working on this, but the result is not good. Maybe it's a rumour!