Hi again.
I've just tried few more cases with GROUP_SEQ_BEGIN/END and it seems it worked, but not in a way I was hoping. Documentation has some conflicting statements. One says that each work item will run in sequence, the other one that each wavefront in a workgroup will run sequentially. It seems that the second one is likely to be correct, though I can't be sure, because as you said, there's some hardware setup is needed.
As for instructions I modify: compiler seems to never generate ADD_PREV, MUL_PREV, MULADD_PREV. There's nothing magic about these instructions, but they help with ALU packing. There's also some float->int and int->float conversion instructions that can go into xyzw slots, allowing for 4 conversions per VLIW instruction, instead of 1.
Also compiler doesn't seem to generate code that uses destination register modifiers like: ADD_SAT, MUL_SAT, MULADD/2, MULADD*2, etc.
It also seems that compiler always generates SET_?? and PREDE_INT/PREDNE_INT pair of instructions instead of directly using PRED_SET?? instruction.
Some time ago I wrote about a problem with read_image function using SAMPLE instruction and doing some math with coordinates, instead of LD then it's possible.