I've been writing a number of compute shaders in DirectX/HLSL assembly (assembled into DirectX bytecode). Many of these shaders perform 32bit rotates. While studying their corresponding .isa files, I've noticed that I can generate 32bit rotates that use two 32bit shifts and an or/xor(generated from 1 ISHL, 1 USHR, and one OR/XOR), or one 64bit shift (generated from one USHR and one BFI). According to RGA/Instruction.cpp at master · GPUOpen-Tools/RGA · GitHub , it seems to be the case that v_alignbit_b32 would be superior to using two shifts and an xor (4 cycles vs 12)... I'm not totally sure how v_lshlrev_b64 compares as it seems to be inexplicably missing from there, but at least with the tests I've been running, it doesn't seem to be as much of an improvement as I'd hope.
With that in mind, is there any way to structure my DirectX/HLSL assembly so that it uses v_alignbit_b32 for 32bit rotates? If not, is that likely to change in future updates to the driver?
I'm running a Radeon RX 580 with up to date drivers.
Hello dipak, could you help to contact the DX/compiler team?
I have already forwarded this query to the DX/HLSL compiler team. Once I get any feedback, I will post.
I'd appreciate that. As an addendum, could you also forward this note about 64bit rotates?
I believe that the most efficient way to perform a 64bit rotate is with one v_mov_b32 and two v_alignbit_b32 instructions. However, I've found that when I write a 64bit rotate that uses (HLSL) two USHR and two BFI instructions, it compiles to two v_lshrrev_b64 and two v_mov_b32 instructions, which (assuming that v_lshrrev_b64 requires >= 4 cycles... a fair assumption considering v_lshrrev_b32 requires 4 cycles) must be inferior to the v_alignbit_b32 method. If it would help, I can start putting together a bitbucket repository for a more concrete bug report.
Thanks dipak - I wrote up a more formal report of all the scenarios being affected below!
Hi Dipak, with the latest driver updates, I've found that I can now generate the v_alignbit_b32 instruction from a 32bit rotate, and my shaders which use it have benefited signicantly. Thanks for your communication and thanks to the development team for their work.
However, I haven't been able to generate the v_alignbit_b32 instruction for 64bit rotates. I've found the following DXBC
(1)
ushr r1.y, r0.x, l(19)
ishl r1.x, r0.x, l(13)
or r2.x, r1.x, r1.y
generates
v_alignbit_b32 v5, v1, v1, 19
...but the following DXBC
(2)
ushr r1.y, r0.x, l(19)
ishl r1.x, r0.y, l(13)
or r2.x, r1.x, r1.y
generates
v_lshrrev_b32 v5, 19, v1
v_lshlrev_b32 v6, 13, v2
v_or_b32 v5, v5, v6
Note that the only difference between the code that generates the v_alignbit_b32 and the code that doesn't is whether or not the same register is being rshifted as lshifted. I've tried other orderings as well as using iadds and xors instead of ors, but no luck there. Are there any plans for a future driver update that might allow for a pattern like (2) to generate the v_alignbit_b32 instruction? It would be enormously helpful to us here for 64bit rotates and the like.
On a somewhat related note, I've also found that I'm unable to generate the v_mul_u32_u24 instruction. For instance
(3)
and r1.x, r0.x, l(0x00ffffff)
and r1.y, r0.y, l(0x00ffffff)
umul r2.x, r2.y, r1.x, r1.y
generates
v_and_b32 v5, 0x00ffffff, v1
v_and_b32 v6, 0x00ffffff, v2
v_mul_hi_u32 v7, v5, v6
v_mul_lo_u32 v5, v5, v6
(4)
and r1.x, r0.z, l(0x00ffffff)
umul r2.x, r2.y, r1.x, l(3)
generates
v_and_b32 v5, 0x00ffffff, v3
v_mul_hi_u32 v6, v5, 3
v_mul_lo_u32 v5, v5, 3
(5)
and r1.x, r0.w, l(0x00ffffff)
umul r2.x, r2.y, l(7), r1.x
generates
v_and_b32 v5, 0x00ffffff, v4
v_mul_hi_u32 v6, 7, v5
v_mul_lo_u32 v5, 7, v5
In all of these cases, generating v_mul_u32_u24/v_mul_hi_u32_u24 would be preferable. I would really appreciate it if you could bring this to the DX/HLSL compiler team.
Thank you for all your work, it's really making a difference for us.
Hi Matthew,
Thanks for your suggestions and sharing the above case details. We really appreciate your valuable inputs and feedback. I'll surely forward these details to the compiler team.
Regarding the 64-bit rotate, the compiler team previously mentioned that it would take some time to implement it because there might be more to do to optimize the 64-bit rotate cases. So, it is expected that you didn't see the v_alignbit_b32 instruction for 64bit rotates with the latest public driver.
As I know, the compiler team has already added a few more optimization cases. However it may take a while to propagate to released driver.
Anyway, if I get any information/update about this, I will share with you.
Thanks once again.
Just got an update from the compiler team that they have added few optimizations to generate v_mul_u32_u24 instruction for the above cases. We really appreciate your suggestions and feedback regarding this.
Thanks.
BTW - dipak, that v_mul_u32_u24 optimization is fantastic. It means a 4x speedup for all types of advanced math like Montgomery multiplication, and complementary operations like the Chinese Remainder theorem.
Do you know if they were also able to optimize the fused-multiply adds? I.e. there's the V_MAD_U32_U24 and V_MAD_I32_I24 operations which similarly could be used to for speedup - one would expect the following:
and r1.x, r0.x, l(0x00ffffff)
and r1.y, r0.y, l(0x00ffffff)
umul r2.x, r2.y, r1.x, r1.y
add r2.y, r1.z
To produce:
v_mul_hi_u32_u24 v7, v5, v4
v_mad_u32_u24 v6, v5, v4, v3
Where the v_mul_hi_u32_u24 is only emitted if the upper operand is used.
Thank you for sharing the above detail. I'll check with the compiler team about this case.
Hi optimiz3,
As I've been informed, the compiler team expects that the above optimization should work now, however they haven't verified it.
Sorry for this delayed reply.
Thanks.
Thanks dipak, do you know if there are any benefits if the driver were to special case using v_alignbyte in situations where v_alignbit could be used but the shift amount is a multiple of 8? AFAIK both instructions have the same timings, but we were wondering if v_alignbyte would result in better thermals/power efficiency in situations where you'd be rotating by 8, 16, or 24 bits.
Sorry, I don't know much details about theses optimizations, so I can't say whether there would be any benefit or not for the above special case.
If you have any suggestion, please feel free to share your suggestions/thoughts. I will forward those to the compiler team.
On a side note, it would be really helpful for us (mainly for the compiler team) if a consolidated case report (like you provided earlier) is shared instead of one by one separately.
Thanks.
Thanks for posting this! We've also run into this problem too on all GCN devices from Pitcairn to Vega.
It's a huge pain point because a rotate implemented as a 64 bit shift costs way more than a bitalign which is a 4 cycle (lowest cost) op on GCN.
Repros on everything from DX10 ShaderModel 4 to D12 ShaderModel 5 DXBC shaders. Also, in many cases its valid to reuse a SM4 dxbc shader across DX11/DX12 drivers to save binary space. It would be wonderful if this could be fixed at all levels.
This would hugely help our customers, a large number have various generations of GCN hardware and frankly NVidia's drivers do a lot better here (their equivalent instruction is called a funnel shift and has been around since Kepler).
Affected platforms:
Southern Islands
Sea Islands
Volcanic Islands
Arctic Islands
Tested scenarios:
DX11 w/ Shader Model 4.0 DXBC
DX11 w/ Shader Model 5.0 DXBC
DX12 w/ Shader Model 4.0 DXBC
DX12 w/ Shader Model 5.0 DXBC
Justification:
n-way 32bit shifts are dramatically slower when implemented as a 64-bit shift instead of V_ALIGNBIT.
Business cases:
BigInteger multiplication/division by power of 2 (chained shifts)
AES encryption/decryption (S-box lookup followed by 32-bit rotate)
64-bit integer shift/rotate (chained rotates)
SHA-256 32-bit rotate (ex: A >> 13 | A << 19)
Scenario 1: 64-bit rotate (should generalize to n*32-bit rotate)
Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.
C-pseudo-code:
uint64_t rotate_right64(uint64_t r0, uint8_t shift = 5)
{
return r0 >> shift | r0 << (64-shift);
}
HLSL-pseudo-code:
uint2 rotate_right64(uint2 r0, uint shift = 5)
{
uint2 r1 = r0.xy >> shift;
return r1.xy | r0.yx << (64-shift);
}
HLSL ShaderModel 4 and 5:
DXBC:
ushr r0.zw, r0.yyyx, l(5)
ishl r0.xy, r0.xyxx, l(27)
iadd r0.xy, r0.xyxx, r0.zwzz // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD
amdil (expected):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
bitalign r0.z, r0.x, r0.y, l1
bitalign r0.w, r0.y, r0.x, l1
amdil (actual):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
ushr r0.__zw, r0.yyyx, l1
dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B
ishl r0.xy__, r0.xyxx, l2
iadd r0.xy__, r0.xyxx, r0.zwzz
GCN ISA (expected):
v_alignbit_b32 v2, v0, v1, 5
v_alignbit_b32 v3, v1, v0, 5
GCN ISA (actual):
v_lshrrev_b32 v3, 5, v2
v_lshrrev_b32 v4, 5, v1
v_lshlrev_b32 v1, 27, v1
v_lshlrev_b32 v2, 27, v2
v_add_u32 v1, vcc, v3, v1
v_add_u32 v2, vcc, v4, v2
HLSL ShaderModel 5:
DXBC:
ushr r0.zw, r0.yyyx, l(0, 0, 5, 5)
bfi r0.xy, l(5, 5, 0, 0), l(27, 27, 0, 0), r0.xyxx, r0.zwzz
amdil (expected):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
bitalign r0.z, r0.x, r0.y, l1
bitalign r0.w, r0.y, r0.x, l1
amdil (actual):
dcl_literal l1, 0x00000000, 0x00000000, 0x00000005, 0x00000005
ushr r0.__zw, r0.yyyx, l1
dcl_literal l2, 0x00000005, 0x00000005, 0x00000000, 0x00000000
dcl_literal l3, 0x0000001B, 0x0000001B, 0x00000000, 0x00000000
ubit_insert r0.xy__, l2, l3, r0.xyxx, r0.zwzz
GCN ISA (expected):
v_alignbit_b32 v2, v0, v1, 5
v_alignbit_b32 v3, v1, v0, 5
GCN ISA (actual):
v_mov_b32 v3, v1
v_lshrrev_b64 v[3:4], 5, v[2:3]
v_lshrrev_b64 v[4:5], 5, v[1:2]
Scenario 2: 32-bit rotate
Note: The AMD Driver only has to deal with the underlying 32-bit ops, per the DXBC below.
C-pseudo-code:
uint64_t rotate_right32(uint32_t r0, uint8_t shift = 5)
{
return r0 >> shift | r0 << (32-shift);
}
HLSL-pseudo-code:
uint2 rotate_right32(uint r0, uint shift = 5)
{
return r0.x >> shift | r0.x << (32-shift);
}
HLSL ShaderModel 4 and 5:
DXBC:
ushr r1.x, r0.x, l(5)
ishl r0.x, r0.x, l(27)
or r0.x, r0.x, r1.x // NOTE this should work for OR, XOR, and ADD; Microsoft's FXC compiler sometimes will substitute an OR with an ADD
amdil (expected):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
mov r0.y, r0.x
bitalign r0.x, r0.x, r0.y, l1
amdil (actual):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
ushr r1.x___, r0.x, l1
dcl_literal l2, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B
ishl r0.x___, r0.x, l2
ior r0.x___, r0.x, r1.x
GCN ISA (expected):
v_mov_b32 v1, v0
v_alignbit_b32 v0, v1, 5
GCN ISA (actual):
v_lshrrev_b32 v4, 5, v1
v_lshlrev_b32 v5, 27, v1
v_or_b32 v4, v4, v5
HLSL ShaderModel 5:
DXBC:
ushr r1.x, r0.x, l(5)
bfi r0.x, l(5), l(27), r0.x, r1.x
amdil (expected):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
mov r0.y, r0.x
bitalign r0.x, r0.x, r0.y, l1
amdil (actual):
dcl_literal l1, 0x00000005, 0x00000005, 0x00000005, 0x00000005
ushr r1.x___, r0.x, l1
dcl_literal l2, 0x00000005, 0x00000005, 0x00000005, 0x00000005
dcl_literal l3, 0x0000001B, 0x0000001B, 0x0000001B, 0x0000001B
ubit_insert r0.x___, l2, l3, r0.x, r1.x
GCN ISA (expected):
v_mov_b32 v1, v0
v_alignbit_b32 v0, v1, 5
GCN ISA (actual):
v_mov_b32 v5, v6
v_lshrrev_b64 v[4:5], 5, v[5:6]
Thanks for providing all these details. I already shared this complete discussion (thread) to the compiler team. I'll get back to you once I've their reply. Meanwhile, if you have any other finding/suggestion, please share.
Thanks.
Thank you for your patience.
As I've come to know, the compiler team already added few optimizations to handle some cases of converting shifts and ors to a faster alignbit. Hopefully these optimizations will be available with the upcoming driver (some may already be part of the latest one). They are also adding some more optimizations which may take a while to propagate to released driver.
Thanks.
Hi dipak, that's fantastic news - these optimizations should have a huge impact on datacenter cryptographic performance. Thanks for your help in driving this!
Hi dipak, in reviewing the ISA, there's also the v_alignbyte operation. AFAICT, its identical to v_alignbit if one were to use a shift/rotate that is a multiple of 8. What I'm wondering is - does v_alignbyte confer any power savings over v_alignbit? If so, it might be a candidate optimization to use v_alignbyte over v_alignbit for any scenario where the shift/rotate amount is evenly divisible by 8.