I use fixed-size array in registers to reduce fetch size required by kernel.
At some size (11 elements) kernel performance dropped considerably (3 times slowdown) and 22 scratch registers were used.
Kernel occupancy is 25% that corresponds to 8 waves per CU.
That is, instead of using only single workgroup of 4 waves and no scratch registers compiler decided to keep 8 waves (2 workgroups) per CU but introduce 22 scratch registers.
Cause performance dropped greatly it's obviously bad choice.
At array size of 10 there are no scratch registers at all, 8 waves and 3 1VGPR used (I profiling kernel on Loveland GPU).
At array size of 11 there are 22 scratch registers, 31 VGPR and 8 waves too.
Is it possible to tell compiler somehow not to use scratch registers and decrease number of waves in fly instead?
I expect much better performance with more register space used per workitem even if number of waves in flight will be decreased to only 4.
Here is ISA for length of 10:
And here for len of 11: