4 Replies Latest reply on Jan 8, 2010 5:23 PM by BarnacleJunior

    D3D11 cs_5_0 compiler bug - probably

    BarnacleJunior

      My simple little scan routine has been exhibiting a problem under D3D11, which it doesn't under OpenCL.  I do a very basic scan over an array of 1s.  Values 0-47 are ok (they read 0, 1, 2, 3, ec) , but 48-63 all read 47, instead of continuing this increment.  The inclusive scan is ok.  It's only when I subtract the thread's value from the inclusive scan to get the exclusive scan that this problem happens.  It's a bug in the D3D compiler's optimizer, because if I compile with /Od, it works fine.

       

      #define WAVEFRONT 64 groupshared uint sharedSum[NUM_THREADS]; RWStructuredBuffer<uint> values : register(u0); void ThreadSum(uint tid, uint scansize) { uint lane = (WAVEFRONT - 1) & tid; uint laneMask = ~(WAVEFRONT - 1) & tid; [unroll] for(uint offset = 1; offset < scansize; offset<<= 1) { uint tid2 = ((lane - offset) & (WAVEFRONT - 1)) | laneMask; uint target = sharedSum[tid]; uint source = sharedSum[tid2]; bool cond = lane >= offset; target += cond ? source : 0; sharedSum[tid] = target; } } [numthreads(NUM_THREADS, 1, 1)] void Foo(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = NUM_THREADS * gid + tid; uint val = values[target]; sharedSum[tid] = val; ThreadSum(tid, WAVEFRONT); uint inc = sharedSum[tid]; uint exc = inc - val; values[target] = exc; } fxc /T cs_5_0 /E Foo /D NUM_THREADS=64 /Fh Foo.h foo.hlsl Note that values[target] = inc; works. It's only when I subtract val from inc that it gets weird.

        • D3D11 cs_5_0 compiler bug - probably
          BarnacleJunior

          The following code also fails on CS5, with or without optimizations.  BTW this is the scan routine used in CUDPP.  If I uncomment the barrier it works.

          #define WAVEFRONT 64 groupshared uint sharedSum[2 * NUM_THREADS]; RWStructuredBuffer<uint> values : register(u0); #define barrier GroupMemoryBarrierWithGroupSync uint ThreadSumExc(uint val, uint tid, uint scansize) { uint index = 2 * tid - ((WAVEFRONT - 1) & tid); sharedSum[index] = 0; index += WAVEFRONT; sharedSum[index] = val; [unroll] for(uint offset = 1; offset < scansize; offset<<= 1) { // barrier(); sharedSum[index] += sharedSum[index - offset]; } return sharedSum[index - 1]; } [numthreads(NUM_THREADS, 1, 1)] void Foo(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = NUM_THREADS * gid + tid; uint val = values[target]; uint exc = ThreadSumExc(val, tid, WAVEFRONT); values[target] = exc; }

            • D3D11 cs_5_0 compiler bug - probably
              BarnacleJunior

              Yes I think this second example is also a D3D problem. The following equivalent OpenCL shader works fine:

               

              #define WAVEFRONT 64 uint ThreadSum(uint val, uint tid, uint scansize, __local uint sharedSum[2 * NUM_THREADS]) { uint index = 2 * tid - ((WAVEFRONT - 1) & tid); sharedSum[index] = 0; index += WAVEFRONT; sharedSum[index] = val; for(uint offset = 1; offset < scansize; offset<<= 1) sharedSum[index] += sharedSum[index - offset]; return sharedSum[index - 1]; } __kernel __attribute__((reqd_work_group_size(NUM_THREADS, 1, 1))) void Foo(__global uint* values) { __local uint sharedSum[2 * NUM_THREADS]; uint tid = get_local_id(0); uint gid = get_group_id(0); uint target = NUM_THREADS * gid + tid; uint val = values[target]; uint exc = ThreadSum(val, tid, WAVEFRONT, sharedSum); values[target] = exc; }

                • D3D11 cs_5_0 compiler bug - probably
                  BarnacleJunior

                  The D3D compiler is indeed generating bad code.  I posted the disassembly here:

                  http://forums.xna.com/forums/p/45399/271377.aspx#271377

                  I hope there is a new compiler out soon, as the August SDK is a mess.

                    • D3D11 cs_5_0 compiler bug - probably
                      BarnacleJunior

                      The problem described in the XNA thread clears up if I zero those sharedSum elements, then barrier, then run my scan.  No idea why.  But it introduces a rather insidious problem.  When I try to scan the last inclusive element in each wavefront (as CUDPP CTA scan does) it fails..  It seems as if the register holding the original value is being overwriten or recycled somehow.

                      I'm building with NUM_THREADS=512 in D3D11.  The values array is filled with 1s so the scan should return 0, 1, 2, 3, etc.  When subtracting the value of the element to produce the exclusive scan value from the inclusive scan value, the results are messed up.  However, since I know val=1 always, I can subtract that as a hard coded term.. and it works in this case!  I've inspected the HLSL and it appears to be correct in both cases.  Maybe the ATI driver itself is getting confused with this?

                      So many problems just adding up numbers..

                      #define WAVEFRONT 64 groupshared uint sharedSum[2 * NUM_THREADS]; RWStructuredBuffer<uint> values : register(u0); #define barrier GroupMemoryBarrierWithGroupSync void PrepareThreadSum(uint tid) { uint index = 2 * tid - ((WAVEFRONT - 1) & tid); sharedSum[index] = 0; } void ThreadSumInternal(uint index, uint scansize) { [unroll] for(uint offset = 1; offset < scansize; offset<<= 1) { volatile uint target = sharedSum[index]; target += sharedSum[index - offset]; sharedSum[index] = target; } } uint ThreadSumExc(uint val, uint tid, uint scansize) { uint index = 2 * tid - ((WAVEFRONT - 1) & tid) + WAVEFRONT; sharedSum[index] = val; ThreadSumInternal(index, WAVEFRONT); #if NUM_THREADS > WAVEFRONT // perform a secondary scan barrier(); uint wavefront = tid / WAVEFRONT; uint inc = sharedSum[index]; barrier(); if((WAVEFRONT - 1) == ((WAVEFRONT - 1) & tid)) sharedSum[WAVEFRONT + wavefront] = inc; barrier(); if(tid < WAVEFRONT) ThreadSumInternal(index, NUM_THREADS / WAVEFRONT); barrier(); // this line FAILS return sharedSum[WAVEFRONT + wavefront - 1] + inc - val; // this line WORKS.. yet val is always 1! wtf return sharedSum[WAVEFRONT + wavefront - 1] + inc - 1; #else return sharedSum[index] - val; #endif } [numthreads(NUM_THREADS, 1, 1)] void Foo(uint tid : SV_GroupIndex, uint3 groupID : SV_GroupID) { uint gid = groupID.x; uint target = NUM_THREADS * gid + tid; PrepareThreadSum(tid); barrier(); uint val = values[target]; uint exc = ThreadSumExc(val, tid, WAVEFRONT); values[target] = exc; }