According to the docs, I should be able to have
mcall(0), (r32), (r32)
mcall(0), (r33), (r33)
mcall(0), (r32+n), (r32+n)
indeed this compiles just fine...but following that as an example, why, oh why does it generate n case blocks, when n/4 covers it? A jump table/4 is just as easy as a jump table/1...
I'm not sure if this is the reason for my super poor performance or not, but this is certainly a code size easy win, which will certainly improve instruction locality....for now, I suppose I can add a restriction that n has to be n mod 4 == 0, or at least padded so it effects nothing, but this is pretty stupid...I just wonder if I leave the empty cases out, will it still construct a jump table....off to find out!