I have a kernel which is using 720 scratch registers. Perf is low, even though occupancy is 60%.
The kernel is very serial: no LDS usage or interaction between work items, just each work item processing
a serial algorithm. Each kernel uses a private array of size 3K, so I think this is what is causing such
high scratch usage.
Besides reducing the size of this array, what else can I do to eliminate scratch registers?
I solved this: moved the large buffer and another buffer into local memory. No more scratch registers, and I got a huge boost in performance.