8 Replies Latest reply on Sep 13, 2018 2:24 AM by optimiz3

    Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

    gfmatt

      I've been writing a number of compute shaders in DirectX/HLSL assembly (assembled into DirectX bytecode).  Many of these shaders perform 32bit rotates.  While studying their corresponding .isa files, I've noticed that I can generate 32bit rotates that use two 32bit shifts and an or/xor(generated from 1 ISHL, 1 USHR, and one OR/XOR), or one 64bit shift (generated from one USHR and one BFI).  According to RGA/Instruction.cpp at master · GPUOpen-Tools/RGA · GitHub , it seems to be the case that v_alignbit_b32 would be superior to using two shifts and an xor (4 cycles vs 12)... I'm not totally sure how v_lshlrev_b64 compares as it seems to be inexplicably missing from there, but at least with the tests I've been running, it doesn't seem to be as much of an improvement as I'd hope.

       

      With that in mind, is there any way to structure my DirectX/HLSL assembly so that it uses v_alignbit_b32 for 32bit rotates?  If not, is that likely to change in future updates to the driver?

       

      I'm running a Radeon RX 580 with up to date drivers.

        • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
          xhuang

          Hello dipak, could you help to contact the DX/compiler team?

          • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
            optimiz3

            Thanks for posting this!  We've also run into this problem too on all GCN devices from Pitcairn to Vega.

             

            It's a huge pain point because a rotate implemented as a 64 bit shift costs way more than a bitalign which is a 4 cycle (lowest cost) op on GCN.

             

            Repros on everything from DX10 ShaderModel 4 to D12 ShaderModel 5 DXBC shaders.  Also, in many cases its valid to reuse a SM4 dxbc shader across DX11/DX12 drivers to save binary space.  It would be wonderful if this could be fixed at all levels.

             

            This would hugely help our customers, a large number have various generations of GCN hardware and frankly NVidia's drivers do a lot better here (their equivalent instruction is called a funnel shift and has been around since Kepler).

             

             

            Affected platforms:

            Southern Islands

            Sea Islands

            Volcanic Islands

            Arctic Islands

             

            Tested scenarios:

            DX11 w/ Shader Model 4.0 DXBC

            DX11 w/ Shader Model 5.0 DXBC

            DX12 w/ Shader Model 4.0 DXBC

            DX12 w/ Shader Model 5.0 DXBC

             

            Justification:

            n-way 32bit shifts are dramatically slower when implemented as a 64-bit shift instead of V_ALIGNBIT.

             

            Business cases:

            BigInteger multiplication/division by power of 2 (chained shifts)

            AES encryption/decryption (S-box lookup followed by 32-bit rotate)

            64-bit integer shift/rotate (chained rotates)

            SHA-256 32-bit rotate (ex: A >> 13 | A << 19)

             

            Scenario 1: 64-bit rotate (should generalize to n*32-bit rotate)

              Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

             

              C-pseudo-code:

                uint64_t rotate_right64(uint64_t r0, uint8_t shift = 5)

                {   

                   return r0 >> shift | r0 << (64-shift);

                }

             

              HLSL-pseudo-code:

                uint2 rotate_right64(uint2 r0, uint shift = 5)

                {

                   uint2 r1 = r0.xy >> shift;

                   return r1.xy | r0.yx << (64-shift);

                }

             

              HLSL ShaderModel 4 and 5:

                DXBC:

                  ushr r0.zw, r0.yyyx, l(5)

                  ishl r0.xy, r0.xyxx, l(27)

                  iadd r0.xy, r0.xyxx, r0.zwzz // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

             

                amdil (expected):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  bitalign r0.z, r0.x, r0.y, l1

                  bitalign r0.w, r0.y, r0.x, l1

             

                amdil (actual):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  ushr r0.__zw, r0.yyyx, l1

                  dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                  ishl r0.xy__, r0.xyxx, l2

                  iadd r0.xy__, r0.xyxx, r0.zwzz

             

                GCN ISA (expected):

                  v_alignbit_b32 v2, v0, v1, 5

                  v_alignbit_b32 v3, v1, v0, 5

             

                GCN ISA (actual):

                  v_lshrrev_b32  v3, 5, v2

                  v_lshrrev_b32  v4, 5, v1

                  v_lshlrev_b32  v1, 27, v1

                  v_lshlrev_b32  v2, 27, v2

                  v_add_u32     v1, vcc, v3, v1

                  v_add_u32     v2, vcc, v4, v2

             

               HLSL ShaderModel 5:

                DXBC:

                  ushr r0.zw, r0.yyyx, l(0, 0, 5, 5)

                  bfi r0.xy, l(5, 5, 0, 0), l(27, 27, 0, 0), r0.xyxx, r0.zwzz

             

                amdil (expected):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  bitalign r0.z, r0.x, r0.y, l1

                  bitalign r0.w, r0.y, r0.x, l1

             

                amdil (actual):

                  dcl_literal l1, 0x00000000, 0x00000000, 0x00000005, 0x00000005

                  ushr r0.__zw, r0.yyyx, l1

                  dcl_literal l2, 0x00000005, 0x00000005, 0x00000000, 0x00000000

                  dcl_literal l3, 0x0000001B, 0x0000001B, 0x00000000, 0x00000000

                  ubit_insert r0.xy__, l2, l3, r0.xyxx, r0.zwzz

             

                GCN ISA (expected):

                  v_alignbit_b32 v2, v0, v1, 5

                  v_alignbit_b32 v3, v1, v0, 5

             

                GCN ISA (actual):

                  v_mov_b32     v3, v1

                  v_lshrrev_b64  v[3:4], 5, v[2:3]

                  v_lshrrev_b64  v[4:5], 5, v[1:2]

             

             

            Scenario 2: 32-bit rotate

              Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

             

              C-pseudo-code:

                uint64_t rotate_right32(uint32_t r0, uint8_t shift = 5)

                {   

                   return r0 >> shift | r0 << (32-shift);

                }

             

              HLSL-pseudo-code:

                uint2 rotate_right32(uint r0, uint shift = 5)

                {

                   return r0.x >> shift | r0.x << (32-shift);

                }

             

              HLSL ShaderModel 4 and 5:

                DXBC:

                  ushr r1.x, r0.x, l(5)

                  ishl r0.x, r0.x, l(27)

                  or r0.x, r0.x, r1.x     // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

             

                amdil (expected):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  mov r0.y, r0.x

                  bitalign r0.x, r0.x, r0.y, l1

             

                amdil (actual):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  ushr r1.x___, r0.x, l1

                  dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                  ishl r0.x___, r0.x, l2

                  ior r0.x___, r0.x, r1.x

             

                GCN ISA (expected):

                  v_mov_b32 v1, v0

                  v_alignbit_b32 v0, v1, 5

             

                GCN ISA (actual):

                  v_lshrrev_b32  v4, 5, v1

                  v_lshlrev_b32  v5, 27, v1

                  v_or_b32      v4, v4, v5

             

               HLSL ShaderModel 5:

                DXBC:

                  ushr r1.x, r0.x, l(5)

                  bfi r0.x, l(5), l(27), r0.x, r1.x

             

                amdil (expected):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  mov r0.y, r0.x

                  bitalign r0.x, r0.x, r0.y, l1

             

                amdil (actual):

                  dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  ushr r1.x___, r0.x, l1

                  dcl_literal l2, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                  dcl_literal l3, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                  ubit_insert r0.x___, l2, l3, r0.x, r1.x

             

                GCN ISA (expected):

                  v_mov_b32 v1, v0

                  v_alignbit_b32 v0, v1, 5

             

                GCN ISA (actual):

                  v_mov_b32     v5, v6

                  v_lshrrev_b64  v[4:5], 5, v[5:6]