Good morning, during a shader optimization pass, I noticed that the AMD shader compiler was not generating FMA instructions.
The case was really simple given the source code:
myValue.xy = (anototherValue.xy + 1.0f) * 0.5f;
I was not able to generate FMA instruction unless myValue (which is a float4) was fully initialized, something like adding before myValue.zw = 1.0f;.
I have written a blog post about it at length, with all the different disassemblies.
What I would like to know if it is a compiler bug, or if there is a specific reason I might not know behind that behavior?
PS: Is it there a dedicated HLSL/DirectX forum or is this the closest matching one?
Hello, thanks for you report, we will investigate this.
Hello dipak, do you know where to discuss DX/D3D topics? As this thread is not for OpenGL or Vulkan.
Currently, we don't have a dedicated forum for DX/D3D topics. I guess someone from Direct3D compiler team can help on this. I'll check.
Thank you for getting back to us, please let me know if you manage to contact anyone from the d3d compiler. I managed to get a hold of the Microsoft guys working on the new dx12 compiler for shader model 6. That compiler flat out refused to generate FMA in any case, they replied that as of now they only support fma for doubles, and will support for floats later on.
After checking with the compiler team, it looks like neither the HLSL compiler nor our shader compiler has the optimization to transform ADD/MUL, each with one literal operand, into a MAD/FMA. So, it is expected that no MAD/FMA instruction was generated for below code:
myValue.xy = (anototherValue.xy + 1.0f) * 0.5f
In the general case, when the operands aren’t literals, only MUL/ADD can be transformed to MAD, not ADD/MUL.
From this observation, however, they are planning to add some special optimizations to our compiler to handle cases like this.
Now coming to FMA vs. MAD. On GFX6-GFX8 devices FMA is a quarter-rate instruction, so it’s not used as an optimization. On newer targets, FMA may be generated instead of MAD. It’s just as fast, has higher precision for its intermediate result, and supports the hardware denorm mode (MAD always flushes denorms).