Is there a way to get the compiler to emit dmad instructions without calling mad() or fma()? I looked at the fma macro and it does the following:
It seems the 4 mov instructions are extraneous and the in0, in1, in2, and out0 registers can be directly fed into the dmad instruction.
mdef(358)_out(1)_in(3) mov r0, in0 mov r1, in1 mov r2, in2 dmad r0.xy__, r0.xy, r1.xy, r2.xy mov out0, r0 mend
The code I observed was generated by using clGetProgramInfo and storing the binary to a file. Is there a way to instead view the device-specific assembly code? I think most of the movs are getting optimized away as I'm getting over 50% of double precision peak in a DGEMM using fma().