Maybe this is a known issue. It is a devious one. I've got a very well optimized radix sort. It works fine when just sorting keys, but there is corruption when pushing the values around too. The HLSL IL looks fine. I'm guessing the Radeon driver is 'optimizing' out the LDS re-order operation because the values aren't getting involved in any ALU ops - they are just the input/output of global memory or LDS load/stores.
I've prepared two shaders:
edit: can the moderators contact me for the shaders if they want to investigate this?
They are identical except for three lines (at line 696-698 in withhack.h):
ieq r2.xyzw, r4.xyzw, l(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff)
movc r0.xyzw, r2.xyzw, r3.xyzw, r0.xyzw
movc r1.xyzw, r2.xyzw, r6.xyzw, r1.xyzw
r0 and r1 hold the values in question. The condition is always false, so r0 and r1 will not be modified. Without this hack, the shader produces bad code. Every other instruction is exactly the same (I've checked in diff).
It appears that the r0 and r1 registers are being polluted with data pulled in from the global fetch that comes right after:
ld_structured_indexable(structured_buffer, stride=4)(mixed,mixed,mixed,mixed) r2.x, r2.x, l(0), t0.xxxx
store_structured g0.x, vThreadIDInGroupFlattened.x, l(0), r2.x
The r0 and r1 registers are not overwritten in the IL, but that shared memory still seems to be leaking into the UAV u1 writes at the bottom of the code.
The keys are sorted in lines 660-676, and the values are sorted the exact same way in 678-694.
I think I've been very good about using sync_g and sync_g_t when required. I've tried filling up the code with additional sync_g_ts and it doesn't make the problem go away. Only using ALU ops on the values makes it go away. I'm not posting the source because it's huge and full of macros.
The values are fetched on lines 650-657. They are being fetched correctly. It's the scatter/gather on lines 678-694 that fails... If I comment that phase out, I get expected (but unsorted) values into my UAV. Bug only happens when scatter/gathering through LDS and straight to UAV with no ALU ops.
Can I get my hands on a tool that takes DX IL binary and gives a CAL ISA file?