I've analyzed an OpenCL-kernel using CodeXL and I am quite happy with the register-usage - on GCN 1.0/1.1 devices per SIMD the maximum of 10 wavefronts can be queued, so hopefully memory latencies can be hidden efficiently.
However on GCN-1.2 devices (Tonga), SGPRs usage exploded - while on Capverde the same kernel consumes 32 SPGRs, on Tonga 94 SGPRs are required which limits the kernel to 5 parallel waves per SIMD (screenshots attached).
Any idea why the same Code running on Tonga requires almost 3 times the SGPRs?
Have there been architectural changes to Tonga or are there pitfalls when it comes to SGPR usage?
Thank you in advance, Clemens