We finially went through and characterized one of the more elusive bug causing many problems in our code.
The conditions:
Bytealign/bitalign (by byte values only) inside a macro called from a function using literal values using the same value for src0, and src1 to implement rotations. (Convoluted conditions? )
The optimizer sees the constants and decides to pre-rotate for you. bit/byte align rotate *RIGHT*. When the optimizer decides to do this for you, it rotates *INCORRECTLY* to the left. Unfortunately in our code, this same call is not always with constant values. The initial conditions to the loop are literals, but as we get data off the network stream, it is combined with the constant data (therefore making it non-constant data later on).
A colleague made a pretty simple test case showing this:
mdef(0)_out(1)_in(2)
bytealign out0,in0,in0,in1.x
mend
il_cs_2_0
dcl_num_thread_per_group 64
dcl_raw_uav_id(7)
dcl_literal l5, 0, 8, 0x30, 4
imul r42.z, l5.z, vAbsTidFlat.x
dcl_literal l16, 0, 16, 0, 0
dcl_literal l17, 1, 2, 3, 4
dcl_literal l19, 0x00010203, 0x04050607, 0x08090a0b, 0x0c0d0e0f
call 5
endmain
func 5
mov r43.x, r42.z
iadd r43.y, r43.x, l16.y
mcall(0) (r0),(l19,l17.zzzz)
uav_raw_store_id(7) mem, r43.x,l19
uav_raw_store_id(7) mem, r43.y,r0
ret
endfunc
end