I have a pretty fat (GPR-wise) kernel. It uses 948 scratch registers. I have a feeling that some overflow happens in this kernel, as during the execution of the fat place (where I have high stack workload) it shares flow control decisions per whole wavefront (e.g. if I have some flow control return, then the whole wavefront returns with it as soon as any thread (thread#0?) in the wavefront hits this return). I don't use any local memory inside the kernel.
This kernel works correctly on both Nvidia and Intel archs. My GPU is one of HD7900 series, latest APP SDK and drivers.
Did anyone have similar problems with scratchpad size? I can attach the compiled assembler code from APP Profiler if necessary.