cancel
Showing results for 
Search instead for 
Did you mean: 

OpenGL & Vulkan

gfmatt
Journeyman III

Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

I've been writing a number of compute shaders in DirectX/HLSL assembly (assembled into DirectX bytecode).  Many of these shaders perform 32bit rotates.  While studying their corresponding .isa files, I've noticed that I can generate 32bit rotates that use two 32bit shifts and an or/xor(generated from 1 ISHL, 1 USHR, and one OR/XOR), or one 64bit shift (generated from one USHR and one BFI).  According to RGA/Instruction.cpp at master · GPUOpen-Tools/RGA · GitHub​ , it seems to be the case that v_alignbit_b32 would be superior to using two shifts and an xor (4 cycles vs 12)... I'm not totally sure how v_lshlrev_b64 compares as it seems to be inexplicably missing from there, but at least with the tests I've been running, it doesn't seem to be as much of an improvement as I'd hope.

With that in mind, is there any way to structure my DirectX/HLSL assembly so that it uses v_alignbit_b32 for 32bit rotates?  If not, is that likely to change in future updates to the driver?

I'm running a Radeon RX 580 with up to date drivers.

0 Likes
Reply
17 Replies
xhuang
Staff
Staff

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Hello dipak​, could you help to contact the DX/compiler team?

dipak
Staff
Staff

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

I have already forwarded this query to the DX/HLSL compiler team. Once I get any feedback, I will post.

gfmatt
Journeyman III

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

I'd appreciate that.  As an addendum, could you also forward this note about 64bit rotates?

I believe that the most efficient way to perform a 64bit rotate is with one v_mov_b32 and two v_alignbit_b32 instructions.  However, I've found that when I write a 64bit rotate that uses (HLSL) two USHR and two BFI instructions, it compiles to two v_lshrrev_b64 and two v_mov_b32 instructions, which (assuming that v_lshrrev_b64 requires >= 4 cycles... a fair assumption considering v_lshrrev_b32 requires 4 cycles) must be inferior to the v_alignbit_b32 method.  If it would help, I can start putting together a bitbucket repository for a more concrete bug report.

0 Likes
Reply
optimiz3
Adept II

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Thanks for posting this!  We've also run into this problem too on all GCN devices from Pitcairn to Vega.

It's a huge pain point because a rotate implemented as a 64 bit shift costs way more than a bitalign which is a 4 cycle (lowest cost) op on GCN.

Repros on everything from DX10 ShaderModel 4 to D12 ShaderModel 5 DXBC shaders.  Also, in many cases its valid to reuse a SM4 dxbc shader across DX11/DX12 drivers to save binary space.  It would be wonderful if this could be fixed at all levels.

This would hugely help our customers, a large number have various generations of GCN hardware and frankly NVidia's drivers do a lot better here (their equivalent instruction is called a funnel shift and has been around since Kepler).

Affected platforms:

Southern Islands

Sea Islands

Volcanic Islands

Arctic Islands

Tested scenarios:

DX11 w/ Shader Model 4.0 DXBC

DX11 w/ Shader Model 5.0 DXBC

DX12 w/ Shader Model 4.0 DXBC

DX12 w/ Shader Model 5.0 DXBC

Justification:

n-way 32bit shifts are dramatically slower when implemented as a 64-bit shift instead of V_ALIGNBIT.

Business cases:

BigInteger multiplication/division by power of 2 (chained shifts)

AES encryption/decryption (S-box lookup followed by 32-bit rotate)

64-bit integer shift/rotate (chained rotates)

SHA-256 32-bit rotate (ex: A >> 13 | A << 19)

Scenario 1: 64-bit rotate (should generalize to n*32-bit rotate)

  Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

  C-pseudo-code:

    uint64_t rotate_right64(uint64_t r0, uint8_t shift = 5)

    {   

       return r0 >> shift | r0 << (64-shift);

    }

  HLSL-pseudo-code:

    uint2 rotate_right64(uint2 r0, uint shift = 5)

    {

       uint2 r1 = r0.xy >> shift;

       return r1.xy | r0.yx << (64-shift);

    }

  HLSL ShaderModel 4 and 5:

    DXBC:

      ushr r0.zw, r0.yyyx, l(5)

      ishl r0.xy, r0.xyxx, l(27)

      iadd r0.xy, r0.xyxx, r0.zwzz // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

 

    amdil (expected):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      bitalign r0.z, r0.x, r0.y, l1

      bitalign r0.w, r0.y, r0.x, l1

    amdil (actual):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      ushr r0.__zw, r0.yyyx, l1

      dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

      ishl r0.xy__, r0.xyxx, l2

      iadd r0.xy__, r0.xyxx, r0.zwzz

    GCN ISA (expected):

      v_alignbit_b32 v2, v0, v1, 5

      v_alignbit_b32 v3, v1, v0, 5

    GCN ISA (actual):

      v_lshrrev_b32  v3, 5, v2

      v_lshrrev_b32  v4, 5, v1

      v_lshlrev_b32  v1, 27, v1

      v_lshlrev_b32  v2, 27, v2

      v_add_u32     v1, vcc, v3, v1

      v_add_u32     v2, vcc, v4, v2

   HLSL ShaderModel 5:

    DXBC:

      ushr r0.zw, r0.yyyx, l(0, 0, 5, 5)

      bfi r0.xy, l(5, 5, 0, 0), l(27, 27, 0, 0), r0.xyxx, r0.zwzz

 

    amdil (expected):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      bitalign r0.z, r0.x, r0.y, l1

      bitalign r0.w, r0.y, r0.x, l1

    amdil (actual):

      dcl_literal l1, 0x00000000, 0x00000000, 0x00000005, 0x00000005

      ushr r0.__zw, r0.yyyx, l1

      dcl_literal l2, 0x00000005, 0x00000005, 0x00000000, 0x00000000

      dcl_literal l3, 0x0000001B, 0x0000001B, 0x00000000, 0x00000000

      ubit_insert r0.xy__, l2, l3, r0.xyxx, r0.zwzz

    GCN ISA (expected):

      v_alignbit_b32 v2, v0, v1, 5

      v_alignbit_b32 v3, v1, v0, 5

    GCN ISA (actual):

      v_mov_b32     v3, v1

      v_lshrrev_b64  v[3:4], 5, v[2:3]

      v_lshrrev_b64  v[4:5], 5, v[1:2]

Scenario 2: 32-bit rotate

  Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.

  C-pseudo-code:

    uint64_t rotate_right32(uint32_t r0, uint8_t shift = 5)

    {   

       return r0 >> shift | r0 << (32-shift);

    }

  HLSL-pseudo-code:

    uint2 rotate_right32(uint r0, uint shift = 5)

    {

       return r0.x >> shift | r0.x << (32-shift);

    }

  HLSL ShaderModel 4 and 5:

    DXBC:

      ushr r1.x, r0.x, l(5)

      ishl r0.x, r0.x, l(27)

      or r0.x, r0.x, r1.x     // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD

 

    amdil (expected):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      mov r0.y, r0.x

      bitalign r0.x, r0.x, r0.y, l1

    amdil (actual):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      ushr r1.x___, r0.x, l1

      dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

      ishl r0.x___, r0.x, l2

      ior r0.x___, r0.x, r1.x

    GCN ISA (expected):

      v_mov_b32 v1, v0

      v_alignbit_b32 v0, v1, 5

    GCN ISA (actual):

      v_lshrrev_b32  v4, 5, v1

      v_lshlrev_b32  v5, 27, v1

      v_or_b32      v4, v4, v5

   HLSL ShaderModel 5:

    DXBC:

      ushr r1.x, r0.x, l(5)

      bfi r0.x, l(5), l(27), r0.x, r1.x

 

    amdil (expected):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      mov r0.y, r0.x

      bitalign r0.x, r0.x, r0.y, l1

    amdil (actual):

      dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      ushr r1.x___, r0.x, l1

      dcl_literal l2, 0x00000005, 0x00000005, 0x00000005, 0x00000005

      dcl_literal l3, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B

      ubit_insert r0.x___, l2, l3, r0.x, r1.x

    GCN ISA (expected):

      v_mov_b32 v1, v0

      v_alignbit_b32 v0, v1, 5

    GCN ISA (actual):

      v_mov_b32     v5, v6

      v_lshrrev_b64  v[4:5], 5, v[5:6]

0 Likes
Reply
optimiz3
Adept II

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Thanks dipak - I wrote up a more formal report of all the scenarios being affected below!

0 Likes
Reply
dipak
Staff
Staff

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Thanks for providing all these details. I already shared this complete discussion (thread) to the compiler team. I'll get back to you once I've their reply. Meanwhile, if you have any other finding/suggestion, please share.

Thanks.

dipak
Staff
Staff

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Thank you for your patience.

As I've come to know, the compiler team already added few optimizations to handle some cases of converting shifts and ors to a faster alignbit. Hopefully these optimizations will be available with the upcoming driver (some may already be part of the latest one). They are also adding some more optimizations which may take a while to propagate to released driver.

Thanks.

optimiz3
Adept II

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Hi dipak, that's fantastic news - these optimizations should have a huge impact on datacenter cryptographic performance.  Thanks for your help in driving this!

0 Likes
Reply
gfmatt
Journeyman III

Re: Is it possible to force my driver to generate the v_alignbit_b32 from dxbc?

Hi Dipak, with the latest driver updates, I've found that I can now generate the v_alignbit_b32 instruction from a 32bit rotate, and my shaders which use it have benefited signicantly.  Thanks for your communication and thanks to the development team for their work.

However, I haven't been able to generate the v_alignbit_b32 instruction for 64bit rotates.  I've found the following DXBC

(1)

     ushr r1.y, r0.x, l(19)

     ishl r1.x, r0.x, l(13)

     or r2.x, r1.x, r1.y

generates

     v_alignbit_b32  v5, v1, v1, 19

...but the following DXBC

(2)

     ushr r1.y, r0.x, l(19)

     ishl r1.x, r0.y, l(13)

     or r2.x, r1.x, r1.y

generates

     v_lshrrev_b32     v5, 19, v1

     v_lshlrev_b32     v6, 13, v2

     v_or_b32            v5, v5, v6

Note that the only difference between the code that generates the v_alignbit_b32 and the code that doesn't is whether or not the same register is being rshifted as lshifted.  I've tried other orderings as well as using iadds and xors instead of ors, but no luck there.  Are there any plans for a future driver update that might allow for a pattern like (2) to generate the v_alignbit_b32 instruction?  It would be enormously helpful to us here for 64bit rotates and the like.

On a somewhat related note, I've also found that I'm unable to generate the v_mul_u32_u24 instruction.  For instance

(3)

     and r1.x, r0.x, l(0x00ffffff)

     and r1.y, r0.y, l(0x00ffffff)

     umul r2.x, r2.y, r1.x, r1.y

generates

     v_and_b32 v5, 0x00ffffff, v1

     v_and_b32 v6, 0x00ffffff, v2

     v_mul_hi_u32 v7, v5, v6

     v_mul_lo_u32 v5, v5, v6

(4)

     and r1.x, r0.z, l(0x00ffffff)

     umul r2.x, r2.y, r1.x, l(3)

generates

     v_and_b32 v5, 0x00ffffff, v3

     v_mul_hi_u32 v6, v5, 3

     v_mul_lo_u32 v5, v5, 3

(5)

     and r1.x, r0.w, l(0x00ffffff)

     umul r2.x, r2.y, l(7), r1.x

generates

     v_and_b32 v5, 0x00ffffff, v4

     v_mul_hi_u32 v6, 7, v5

     v_mul_lo_u32 v5, 7, v5

In all of these cases, generating v_mul_u32_u24/v_mul_hi_u32_u24 would be preferable.  I would really appreciate it if you could bring this to the DX/HLSL compiler team.

Thank you for all your work, it's really making a difference for us.

0 Likes
Reply