17 Replies Latest reply on Nov 2, 2018 6:04 AM by dipak

    Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

    gfmatt

      I've been writing a number of compute shaders in DirectX/HLSL assembly (assembled into DirectX bytecode).  Many of these shaders perform 32bit rotates.  While studying their corresponding .isa files, I've noticed that I can generate 32bit rotates that use two 32bit shifts and an or/xor(generated from 1 ISHL, 1 USHR, and one OR/XOR), or one 64bit shift (generated from one USHR and one BFI).  According to RGA/Instruction.cpp at master · GPUOpen-Tools/RGA · GitHub , it seems to be the case that v_alignbit_b32 would be superior to using two shifts and an xor (4 cycles vs 12)... I'm not totally sure how v_lshlrev_b64 compares as it seems to be inexplicably missing from there, but at least with the tests I've been running, it doesn't seem to be as much of an improvement as I'd hope.

       

      With that in mind, is there any way to structure my DirectX/HLSL assembly so that it uses v_alignbit_b32 for 32bit rotates?  If not, is that likely to change in future updates to the driver?

       

      I'm running a Radeon RX 580 with up to date drivers.

        • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
          xhuang

          Hello dipak, could you help to contact the DX/compiler team?

          1 of 1 people found this helpful
            • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
              dipak

              I have already forwarded this query to the DX/HLSL compiler team. Once I get any feedback, I will post.

              1 of 1 people found this helpful
                • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
                  gfmatt

                  I'd appreciate that.  As an addendum, could you also forward this note about 64bit rotates?

                   

                  I believe that the most efficient way to perform a 64bit rotate is with one v_mov_b32 and two v_alignbit_b32 instructions.  However, I've found that when I write a 64bit rotate that uses (HLSL) two USHR and two BFI instructions, it compiles to two v_lshrrev_b64 and two v_mov_b32 instructions, which (assuming that v_lshrrev_b64 requires >= 4 cycles... a fair assumption considering v_lshrrev_b32 requires 4 cycles) must be inferior to the v_alignbit_b32 method.  If it would help, I can start putting together a bitbucket repository for a more concrete bug report.

                  • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
                    optimiz3

                    Thanks dipak - I wrote up a more formal report of all the scenarios being affected below!

                    • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
                      gfmatt

                      Hi Dipak, with the latest driver updates, I've found that I can now generate the v_alignbit_b32 instruction from a 32bit rotate, and my shaders which use it have benefited signicantly.  Thanks for your communication and thanks to the development team for their work.

                       

                      However, I haven't been able to generate the v_alignbit_b32 instruction for 64bit rotates.  I've found the following DXBC

                       

                      (1)

                           ushr r1.y, r0.x, l(19)

                           ishl r1.x, r0.x, l(13)

                           or r2.x, r1.x, r1.y

                       

                      generates

                       

                           v_alignbit_b32  v5, v1, v1, 19

                       

                      ...but the following DXBC

                       

                      (2)

                           ushr r1.y, r0.x, l(19)

                           ishl r1.x, r0.y, l(13)

                           or r2.x, r1.x, r1.y

                       

                      generates

                       

                           v_lshrrev_b32     v5, 19, v1

                           v_lshlrev_b32     v6, 13, v2

                           v_or_b32            v5, v5, v6

                       

                      Note that the only difference between the code that generates the v_alignbit_b32 and the code that doesn't is whether or not the same register is being rshifted as lshifted.  I've tried other orderings as well as using iadds and xors instead of ors, but no luck there.  Are there any plans for a future driver update that might allow for a pattern like (2) to generate the v_alignbit_b32 instruction?  It would be enormously helpful to us here for 64bit rotates and the like.

                       

                      On a somewhat related note, I've also found that I'm unable to generate the v_mul_u32_u24 instruction.  For instance

                       

                      (3)

                           and r1.x, r0.x, l(0x00ffffff)

                           and r1.y, r0.y, l(0x00ffffff)

                           umul r2.x, r2.y, r1.x, r1.y

                       

                      generates

                       

                           v_and_b32 v5, 0x00ffffff, v1

                           v_and_b32 v6, 0x00ffffff, v2

                           v_mul_hi_u32 v7, v5, v6

                           v_mul_lo_u32 v5, v5, v6

                       

                      (4)

                       

                           and r1.x, r0.z, l(0x00ffffff)

                           umul r2.x, r2.y, r1.x, l(3)

                       

                      generates

                       

                           v_and_b32 v5, 0x00ffffff, v3

                           v_mul_hi_u32 v6, v5, 3

                           v_mul_lo_u32 v5, v5, 3

                       

                      (5)

                       

                           and r1.x, r0.w, l(0x00ffffff)

                           umul r2.x, r2.y, l(7), r1.x

                       

                      generates

                       

                           v_and_b32 v5, 0x00ffffff, v4

                           v_mul_hi_u32 v6, 7, v5

                           v_mul_lo_u32 v5, 7, v5

                       

                      In all of these cases, generating v_mul_u32_u24/v_mul_hi_u32_u24 would be preferable.  I would really appreciate it if you could bring this to the DX/HLSL compiler team.

                       

                      Thank you for all your work, it's really making a difference for us.

                  • Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?
                    optimiz3

                    Thanks for posting this!  We've also run into this problem too on all GCN devices from Pitcairn to Vega.

                     

                    It's a huge pain point because a rotate implemented as a 64 bit shift costs way more than a bitalign which is a 4 cycle (lowest cost) op on GCN.

                     

                    Repros on everything from DX10 ShaderModel 4 to D12 ShaderModel 5 DXBC shaders.  Also, in many cases its valid to reuse a SM4 dxbc shader across DX11/DX12 drivers to save binary space.  It would be wonderful if this could be fixed at all levels.

                     

                    This would hugely help our customers, a large number have various generations of GCN hardware and frankly NVidia's drivers do a lot better here (their equivalent instruction is called a funnel shift and has been around since Kepler).

                     

                     

                    Affected platforms:

                    Southern Islands

                    Sea Islands

                    Volcanic Islands

                    Arctic Islands

                     

                    Tested scenarios:

                    DX11 w/ Shader Model 4.0 DXBC

                    DX11 w/ Shader Model 5.0 DXBC

                    DX12 w/ Shader Model 4.0 DXBC

                    DX12 w/ Shader Model 5.0 DXBC

                     

                    Justification:

                    n-way 32bit shifts are dramatically slower when implemented as a 64-bit shift instead of V_ALIGNBIT.

                     

                    Business cases:

                    BigInteger multiplication/division by power of 2 (chained shifts)

                    AES encryption/decryption (S-box lookup followed by 32-bit rotate)

                    64-bit integer shift/rotate (chained rotates)

                    SHA-256 32-bit rotate (ex: A >> 13 | A << 19)

                     

                    Scenario 1: 64-bit rotate (should generalize to n*32-bit rotate)

                      Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

                     

                      C-pseudo-code:

                        uint64_t rotate_right64(uint64_t r0, uint8_t shift = 5)

                        {   

                           return r0 >> shift | r0 << (64-shift);

                        }

                     

                      HLSL-pseudo-code:

                        uint2 rotate_right64(uint2 r0, uint shift = 5)

                        {

                           uint2 r1 = r0.xy >> shift;

                           return r1.xy | r0.yx << (64-shift);

                        }

                     

                      HLSL ShaderModel 4 and 5:

                        DXBC:

                          ushr r0.zw, r0.yyyx, l(5)

                          ishl r0.xy, r0.xyxx, l(27)

                          iadd r0.xy, r0.xyxx, r0.zwzz // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

                     

                        amdil (expected):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          bitalign r0.z, r0.x, r0.y, l1

                          bitalign r0.w, r0.y, r0.x, l1

                     

                        amdil (actual):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          ushr r0.__zw, r0.yyyx, l1

                          dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                          ishl r0.xy__, r0.xyxx, l2

                          iadd r0.xy__, r0.xyxx, r0.zwzz

                     

                        GCN ISA (expected):

                          v_alignbit_b32 v2, v0, v1, 5

                          v_alignbit_b32 v3, v1, v0, 5

                     

                        GCN ISA (actual):

                          v_lshrrev_b32  v3, 5, v2

                          v_lshrrev_b32  v4, 5, v1

                          v_lshlrev_b32  v1, 27, v1

                          v_lshlrev_b32  v2, 27, v2

                          v_add_u32     v1, vcc, v3, v1

                          v_add_u32     v2, vcc, v4, v2

                     

                       HLSL ShaderModel 5:

                        DXBC:

                          ushr r0.zw, r0.yyyx, l(0, 0, 5, 5)

                          bfi r0.xy, l(5, 5, 0, 0), l(27, 27, 0, 0), r0.xyxx, r0.zwzz

                     

                        amdil (expected):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          bitalign r0.z, r0.x, r0.y, l1

                          bitalign r0.w, r0.y, r0.x, l1

                     

                        amdil (actual):

                          dcl_literal l1, 0x00000000, 0x00000000, 0x00000005, 0x00000005

                          ushr r0.__zw, r0.yyyx, l1

                          dcl_literal l2, 0x00000005, 0x00000005, 0x00000000, 0x00000000

                          dcl_literal l3, 0x0000001B, 0x0000001B, 0x00000000, 0x00000000

                          ubit_insert r0.xy__, l2, l3, r0.xyxx, r0.zwzz

                     

                        GCN ISA (expected):

                          v_alignbit_b32 v2, v0, v1, 5

                          v_alignbit_b32 v3, v1, v0, 5

                     

                        GCN ISA (actual):

                          v_mov_b32     v3, v1

                          v_lshrrev_b64  v[3:4], 5, v[2:3]

                          v_lshrrev_b64  v[4:5], 5, v[1:2]

                     

                     

                    Scenario 2: 32-bit rotate

                      Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

                     

                      C-pseudo-code:

                        uint64_t rotate_right32(uint32_t r0, uint8_t shift = 5)

                        {   

                           return r0 >> shift | r0 << (32-shift);

                        }

                     

                      HLSL-pseudo-code:

                        uint2 rotate_right32(uint r0, uint shift = 5)

                        {

                           return r0.x >> shift | r0.x << (32-shift);

                        }

                     

                      HLSL ShaderModel 4 and 5:

                        DXBC:

                          ushr r1.x, r0.x, l(5)

                          ishl r0.x, r0.x, l(27)

                          or r0.x, r0.x, r1.x     // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

                     

                        amdil (expected):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          mov r0.y, r0.x

                          bitalign r0.x, r0.x, r0.y, l1

                     

                        amdil (actual):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          ushr r1.x___, r0.x, l1

                          dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                          ishl r0.x___, r0.x, l2

                          ior r0.x___, r0.x, r1.x

                     

                        GCN ISA (expected):

                          v_mov_b32 v1, v0

                          v_alignbit_b32 v0, v1, 5

                     

                        GCN ISA (actual):

                          v_lshrrev_b32  v4, 5, v1

                          v_lshlrev_b32  v5, 27, v1

                          v_or_b32      v4, v4, v5

                     

                       HLSL ShaderModel 5:

                        DXBC:

                          ushr r1.x, r0.x, l(5)

                          bfi r0.x, l(5), l(27), r0.x, r1.x

                     

                        amdil (expected):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          mov r0.y, r0.x

                          bitalign r0.x, r0.x, r0.y, l1

                     

                        amdil (actual):

                          dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          ushr r1.x___, r0.x, l1

                          dcl_literal l2, 0x00000005, 0x00000005, 0x00000005, 0x00000005

                          dcl_literal l3, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

                          ubit_insert r0.x___, l2, l3, r0.x, r1.x

                     

                        GCN ISA (expected):

                          v_mov_b32 v1, v0

                          v_alignbit_b32 v0, v1, 5

                     

                        GCN ISA (actual):

                          v_mov_b32     v5, v6

                          v_lshrrev_b64  v[4:5], 5, v[5:6]