cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ekondis
Adept II

Maximum DP floating point throughput without -cl-mad-enable option

Hello,

I'm doing some tests with a kernel that makes intensive use of multiply-add operations on double precision using a GCN GPU (i.e. R9-380X). The operations are not translated to multiply-add instructions but rather as separate multiplication and addition instructions. When the kernel is built using the -cl-mad-enable option the generated instructions are multiply-additions as intended in the first place. Why doesn't the compiler use multiply-addition instructions without using the aforementioned option? Isn't the multiply-addition instruction compliant with the IEEE-754 standard?

Thanks.

0 Likes
1 Solution

Oh, I see what you mean. There are two kinds of fused multiply add operations: fma and mad/mac. Mad/mac are only available as f32 and yield exactly the same result as separate multiply and add. In turn fma can be f32 and f64 and has higher accuracy than separate operations. If you are using fma you may not get exactly the same result as if you do not. Then on some parts fma is also slower. Therefor mad/mac are generated by the compiler automatically for single precision, but fma for double precision only upon request.

View solution in original post

0 Likes
3 Replies
rampitec
Staff

FMA instruction is IEEE compliant. Even though -cl-mad-enable does not mandate compiler to use fma, in practice it should do so when it can see an opportunity provided the option.

Could you please share driver version you are using (clinfo command output) and full list of options supplied to the compiler?

Please also check that SW does not cache compiled binary and the recompilation actually happens with the new option supplied.

Also, how did you check you have no fma instructions generated?

Please note, if you want to make sure using fma instruction for a particular operation vs allowing compiler to use it when it chooses to do so, you can use OpenCL fma() builtin function.

0 Likes

When I use the "-cl-mad-enable" then FMA instructions are generated as expected. The problem is when I omit this option. In that case only separate Add & mul instructions are generated and performance drops by ~33% (very compute intensive kernel). More specifically, I provide a part of the ISA code expressing the types of instructions used captured by using the "-save-temps" option:

By using "-cl-mad-enable":

...

  v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000320: D1CC0010 00022110

  v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000328: D1CC0002 00020502

  v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000330: D1CC000C 0002190C

  v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000338: D1CC0004 00020904

  v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000340: D1CC000A 0002150A

  v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000348: D1CC0008 00021108

  v_fma_f64     v[14:15], v[14:15], v[14:15], s[0:1]    // 000000000350: D1CC000E 00021D0E

  v_fma_f64     v[0:1], v[0:1], v[0:1], s[0:1]          // 000000000358: D1CC0000 00020100

  v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000360: D1CC0010 00022110

  v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000368: D1CC0002 00020502

  v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000370: D1CC000C 0002190C

  v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000378: D1CC0004 00020904

  v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000380: D1CC000A 0002150A

  v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000388: D1CC0008 00021108

...

Without using "-cl-mad-enable":

...

  v_mul_f64     v[10:11], v[10:11], v[10:11]            // 000000000280: D281000A 0002150A

  v_mul_f64     v[8:9], v[8:9], v[8:9]                  // 000000000288: D2810008 00021108

  v_add_f64     v[14:15], v[15:16], s[0:1]              // 000000000290: D280000E 0000010F

  v_add_f64     v[0:1], v[0:1], s[0:1]                  // 000000000298: D2800000 00000100

  v_add_f64     v[16:17], v[17:18], s[0:1]              // 0000000002A0: D2800010 00000111

  v_add_f64     v[2:3], v[2:3], s[0:1]                  // 0000000002A8: D2800002 00000102

  v_mul_f64     v[12:13], v[12:13], v[12:13]            // 0000000002B0: D281000C 0002190C

  v_mul_f64     v[4:5], v[4:5], v[4:5]                  // 0000000002B8: D2810004 00020904

  v_add_f64     v[10:11], v[10:11], s[0:1]              // 0000000002C0: D280000A 0000010A

  v_add_f64     v[8:9], v[8:9], s[0:1]                  // 0000000002C8: D2800008 00000108

  v_mul_f64     v[14:15], v[14:15], v[14:15]            // 0000000002D0: D281000E 00021D0E

  v_mul_f64     v[0:1], v[0:1], v[0:1]                  // 0000000002D8: D2810000 00020100

  v_mul_f64     v[16:17], v[16:17], v[16:17]            // 0000000002E0: D2810010 00022110

  v_mul_f64     v[2:3], v[2:3], v[2:3]                  // 0000000002E8: D2810002 00020502

...

The only compilation specific extra option used for the compilation was "-cl-std=CL1.1"

Driver version used: 1912.5 (VM)

Platform: 64bit Linux (Ubuntu 14.04)

In contrast, when run the same kernel on NVidia GPU then FMA instructions are produced in both cases.

0 Likes

Oh, I see what you mean. There are two kinds of fused multiply add operations: fma and mad/mac. Mad/mac are only available as f32 and yield exactly the same result as separate multiply and add. In turn fma can be f32 and f64 and has higher accuracy than separate operations. If you are using fma you may not get exactly the same result as if you do not. Then on some parts fma is also slower. Therefor mad/mac are generated by the compiler automatically for single precision, but fma for double precision only upon request.

0 Likes