According to the docs, I should be able to have
mcall(0), (r32), (r32)
mcall(0), (r33), (r33)
mcall(0), (r32+n), (r32+n)
indeed this compiles just fine...but following that as an example, why, oh why does it generate n case blocks, when n/4 covers it? A jump table/4 is just as easy as a jump table/1...
I'm not sure if this is the reason for my super poor performance or not, but this is certainly a code size easy win, which will certainly improve instruction locality....for now, I suppose I can add a restriction that n has to be n mod 4 == 0, or at least padded so it effects nothing, but this is pretty stupid...I just wonder if I leave the empty cases out, will it still construct a jump table....off to find out!
Yup, that was the cause of the problem. With all the intermediate cases, I saw a 33% decrease in performance. Without the intermediate cases, on the surface, I saw a 10% increase from when I had the scratch registers, but now I'm using so few registers, I think I can cram another wavefront onto the GPU....
Honestly, that should NOT have that level of impact, 40% is huge for something that should have 0 impact, and honestly should be an easy one to fix!