I have a kernel which is using 720 scratch registers. Perf is low, even though occupancy is 60%.
The kernel is very serial: no LDS usage or interaction between work items, just each work item processing
a serial algorithm. Each kernel uses a private array of size 3K, so I think this is what is causing such
high scratch usage.
Besides reducing the size of this array, what else can I do to eliminate scratch registers?