Archives Discussions

ekondis · ‎01-23-2016

Hello,

I'm doing some tests with a kernel that makes intensive use of multiply-add operations on double precision using a GCN GPU (i.e. R9-380X). The operations are not translated to multiply-add instructions but rather as separate multiplication and addition instructions. When the kernel is built using the -cl-mad-enable option the generated instructions are multiply-additions as intended in the first place. Why doesn't the compiler use multiply-addition instructions without using the aforementioned option? Isn't the multiply-addition instruction compliant with the IEEE-754 standard?

Thanks.

rampitec · ‎03-19-2016

Oh, I see what you mean. There are two kinds of fused multiply add operations: fma and mad/mac. Mad/mac are only available as f32 and yield exactly the same result as separate multiply and add. In turn fma can be f32 and f64 and has higher accuracy than separate operations. If you are using fma you may not get exactly the same result as if you do not. Then on some parts fma is also slower. Therefor mad/mac are generated by the compiler automatically for single precision, but fma for double precision only upon request.

View solution in original post

rampitec · ‎03-15-2016

FMA instruction is IEEE compliant. Even though -cl-mad-enable does not mandate compiler to use fma, in practice it should do so when it can see an opportunity provided the option.

Could you please share driver version you are using (clinfo command output) and full list of options supplied to the compiler?

Please also check that SW does not cache compiled binary and the recompilation actually happens with the new option supplied.

Also, how did you check you have no fma instructions generated?

Please note, if you want to make sure using fma instruction for a particular operation vs allowing compiler to use it when it chooses to do so, you can use OpenCL fma() builtin function.

ekondis · ‎03-19-2016

When I use the "-cl-mad-enable" then FMA instructions are generated as expected. The problem is when I omit this option. In that case only separate Add & mul instructions are generated and performance drops by ~33% (very compute intensive kernel). More specifically, I provide a part of the ISA code expressing the types of instructions used captured by using the "-save-temps" option:

By using "-cl-mad-enable":

...
  v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000320: D1CC0010 00022110
  v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000328: D1CC0002 00020502
  v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000330: D1CC000C 0002190C
  v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000338: D1CC0004 00020904
  v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000340: D1CC000A 0002150A
  v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000348: D1CC0008 00021108
  v_fma_f64     v[14:15], v[14:15], v[14:15], s[0:1]    // 000000000350: D1CC000E 00021D0E
  v_fma_f64     v[0:1], v[0:1], v[0:1], s[0:1]          // 000000000358: D1CC0000 00020100
  v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000360: D1CC0010 00022110
  v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000368: D1CC0002 00020502
  v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000370: D1CC000C 0002190C
  v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000378: D1CC0004 00020904
  v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000380: D1CC000A 0002150A
  v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000388: D1CC0008 00021108
...

Without using "-cl-mad-enable":

...
  v_mul_f64     v[10:11], v[10:11], v[10:11]            // 000000000280: D281000A 0002150A
  v_mul_f64     v[8:9], v[8:9], v[8:9]                  // 000000000288: D2810008 00021108
  v_add_f64     v[14:15], v[15:16], s[0:1]              // 000000000290: D280000E 0000010F
  v_add_f64     v[0:1], v[0:1], s[0:1]                  // 000000000298: D2800000 00000100
  v_add_f64     v[16:17], v[17:18], s[0:1]              // 0000000002A0: D2800010 00000111
  v_add_f64     v[2:3], v[2:3], s[0:1]                  // 0000000002A8: D2800002 00000102
  v_mul_f64     v[12:13], v[12:13], v[12:13]            // 0000000002B0: D281000C 0002190C
  v_mul_f64     v[4:5], v[4:5], v[4:5]                  // 0000000002B8: D2810004 00020904
  v_add_f64     v[10:11], v[10:11], s[0:1]              // 0000000002C0: D280000A 0000010A
  v_add_f64     v[8:9], v[8:9], s[0:1]                  // 0000000002C8: D2800008 00000108
  v_mul_f64     v[14:15], v[14:15], v[14:15]            // 0000000002D0: D281000E 00021D0E
  v_mul_f64     v[0:1], v[0:1], v[0:1]                  // 0000000002D8: D2810000 00020100
  v_mul_f64     v[16:17], v[16:17], v[16:17]            // 0000000002E0: D2810010 00022110
  v_mul_f64     v[2:3], v[2:3], v[2:3]                  // 0000000002E8: D2810002 00020502
...

The only compilation specific extra option used for the compilation was "-cl-std=CL1.1"

Driver version used: 1912.5 (VM)

Platform: 64bit Linux (Ubuntu 14.04)

In contrast, when run the same kernel on NVidia GPU then FMA instructions are produced in both cases.

rampitec · ‎03-19-2016

Oh, I see what you mean. There are two kinds of fused multiply add operations: fma and mad/mac. Mad/mac are only available as f32 and yield exactly the same result as separate multiply and add. In turn fma can be f32 and f64 and has higher accuracy than separate operations. If you are using fma you may not get exactly the same result as if you do not. Then on some parts fma is also slower. Therefor mad/mac are generated by the compiler automatically for single precision, but fma for double precision only upon request.

Archives Discussions

Maximum DP floating point throughput without -cl-mad-enable option