1 Reply Latest reply on Jan 22, 2010 11:20 PM by BarnacleJunior

    Bug in D3D11 driver.  Doesn't scatter correctly.

    BarnacleJunior

      This must be a bug in the 5800 driver for cs_5_0, not the actual fxc compiler (for a change).  I've got a radix sort that works perfectly under REF, and with a kludge, works with hardware.  The problem comes when scattering values during each bit of the radix.  I'm not going to paste all the code, because it's rather long, but I'll put the relevent parts, and all the D3D IL, so you can generate ISA from it - is there a tool to let me generate ISA from D3D IL yet?

      VALUES_PER_THREAD = 8.  That means I have a typedef uint4 Counter[2]; Counter keys; Counter slots[1]; slots[0] holds the .y component of the keys array, and it's macro'd to hold additional values from other arrays.  I have groupshared uint sharedArray[VALUES_PER_THREAD * NUM_THREADS];  I scatter the keys last each bit, so the key values remain in shared memory for the bucket count.  The bug is the values that are supposed to be scattered are not, and are in fact being filled with their corresponding keys.  When the shader completes, the .x and .y components are the same - both are sorted.  The .y component is lost or corrupted.  With some builds all 8 .y values are set to their corresponding key values.  Sometimes the 0 and 1 values of each thread are retained but 2-7 are set to the corresponding keys.  You can see that here:
      http://www.earthrse.com/screenie/20100115-003737.png

      .second of [0] and [1] are good, but 2-7 are the same as .first!

      If in the shader I perturb the .y component by adding 1 before scatter and subtract 1 after gather, I get the right values back in .y.  It's as if an ALU instruction is required to prevent the driver from optimizing out this scatter/gather.

      Here's the fxc IL:

      http://www.earthrse.com/screenie/RadixSortBlock_Pass1_ATI_KeyUint2x_ValueNone_512_4.h

       

       

      [unroll] for(uint bit = 0; bit < NUM_BITS; ++bit) { Counter scatter; ComputeScatter(tid, bit, keys, scatter); KeyToSharedIndex(scatter); #if (NUM_SLOTS > 0) [unroll] for(uint slot = 0; slot < NUM_SLOTS; ++slot) { // scatter the values // slots[0][0] += 1; // slots[0][1] += 1; [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) sharedArray[scatter[i / 4][3 & i]] = slots[slot][i / 4][3 & i]; barrier(); // gather the values [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) slots[slot][i / 4][3 & i] = sharedArray[SharedIndexToKey(tid, i)]; // slots[0][0] -= 1; // slots[0][1] -= 1; barrier(); } #endif // scatter the keys [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) sharedArray[scatter[i / 4][3 & i]] = keys[i / 4][3 & i]; barrier(); // gather the keys [unroll] for(i = 0; i < VALUES_PER_THREAD; ++i) keys[i / 4][3 & i] = sharedArray[SharedIndexToKey(tid, i)]; barrier(); } it's the scattering from/to slots[slot] (just slots[0] in what I'm building) that is failing. SharedIndexToKey just returns VALUES_PER_THREAD * tid + i. Am looping over just 1 bit in this build: store_structured g0.x, r7.x, l(0), r0.z store_structured g0.x, r7.y, l(0), r0.w store_structured g0.x, r7.z, l(0), r3.x store_structured g0.x, r7.w, l(0), r3.y store_structured g0.x, r8.x, l(0), r3.z store_structured g0.x, r8.y, l(0), r3.w store_structured g0.x, r8.z, l(0), r5.w store_structured g0.x, r8.w, l(0), r6.x sync_g_t ishl r0.x, vThreadIDInGroupFlattened.x, l(3) ld_structured r0.z, r0.x, l(0), g0.xxxx ld_structured r0.w, r12.y, l(0), g0.xxxx ld_structured r3.x, r12.z, l(0), g0.xxxx ld_structured r3.y, r12.w, l(0), g0.xxxx ld_structured r3.z, r10.x, l(0), g0.xxxx ld_structured r3.w, r10.y, l(0), g0.xxxx ld_structured r5.w, r10.z, l(0), g0.xxxx ld_structured r6.x, r10.w, l(0), g0.xxxx sync_g_t store_structured g0.x, r7.x, l(0), r1.x store_structured g0.x, r7.y, l(0), r1.y store_structured g0.x, r7.z, l(0), r1.z store_structured g0.x, r7.w, l(0), r1.w store_structured g0.x, r8.x, l(0), r4.x store_structured g0.x, r8.y, l(0), r4.y store_structured g0.x, r8.z, l(0), r4.z store_structured g0.x, r8.w, l(0), r4.w sync_g_t ld_structured r1.x, r0.x, l(0), g0.xxxx ld_structured r1.y, r12.y, l(0), g0.xxxx ld_structured r1.z, r12.z, l(0), g0.xxxx ld_structured r1.w, r12.w, l(0), g0.xxxx ld_structured r4.x, r10.x, l(0), g0.xxxx ld_structured r4.y, r10.y, l(0), g0.xxxx ld_structured r4.z, r10.z, l(0), g0.xxxx ld_structured r4.w, r10.w, l(0), g0.xxxx sync_g_t A bit confusing because it's not gathering the values back into the same registers they were scattered from, but you get the idea. Adding 1 before scatter and subtracting 1 after gather prevents this bug. Like if there is an ALU operation between sets of LDS stores/reads it works: iadd r13.xyzw, r10.xyzw, -r9.xyzw iadd r9.xyzw, r9.xyzw, r6.yyyy movc r8.xyzw, r8.xyzw, r9.xyzw, r13.xyzw store_structured g0.x, r7.x, l(0), r0.z store_structured g0.x, r7.y, l(0), r0.w store_structured g0.x, r7.z, l(0), r3.x store_structured g0.x, r7.w, l(0), r3.y store_structured g0.x, r8.x, l(0), r3.z store_structured g0.x, r8.y, l(0), r3.w store_structured g0.x, r8.z, l(0), r5.w store_structured g0.x, r8.w, l(0), r6.x sync_g_t ishl r0.x, vThreadIDInGroupFlattened.x, l(3) ld_structured r0.z, r0.x, l(0), g0.xxxx ld_structured r0.w, r12.y, l(0), g0.xxxx ld_structured r3.x, r12.z, l(0), g0.xxxx ld_structured r3.y, r12.w, l(0), g0.xxxx ld_structured r3.z, r10.x, l(0), g0.xxxx ld_structured r3.w, r10.y, l(0), g0.xxxx ld_structured r5.w, r10.z, l(0), g0.xxxx ld_structured r6.x, r10.w, l(0), g0.xxxx sync_g_t store_structured g0.x, r7.x, l(0), r1.x store_structured g0.x, r7.y, l(0), r1.y store_structured g0.x, r7.z, l(0), r1.z store_structured g0.x, r7.w, l(0), r1.w store_structured g0.x, r8.x, l(0), r4.x store_structured g0.x, r8.y, l(0), r4.y store_structured g0.x, r8.z, l(0), r4.z store_structured g0.x, r8.w, l(0), r4.w sync_g_t ld_structured r1.x, r0.x, l(0), g0.xxxx ld_structured r1.y, r12.y, l(0), g0.xxxx ld_structured r1.z, r12.z, l(0), g0.xxxx ld_structured r1.w, r12.w, l(0), g0.xxxx ld_structured r4.x, r10.x, l(0), g0.xxxx ld_structured r4.y, r10.y, l(0), g0.xxxx ld_structured r4.z, r10.z, l(0), g0.xxxx ld_structured r4.w, r10.w, l(0), g0.xxxx sync_g_t