3 Replies Latest reply on Mar 19, 2016 12:21 PM by rampitec

    Maximum DP floating point throughput without -cl-mad-enable option

    ekondis

      Hello,

      I'm doing some tests with a kernel that makes intensive use of multiply-add operations on double precision using a GCN GPU (i.e. R9-380X). The operations are not translated to multiply-add instructions but rather as separate multiplication and addition instructions. When the kernel is built using the -cl-mad-enable option the generated instructions are multiply-additions as intended in the first place. Why doesn't the compiler use multiply-addition instructions without using the aforementioned option? Isn't the multiply-addition instruction compliant with the IEEE-754 standard?

      Thanks.

        • Re: Maximum DP floating point throughput without -cl-mad-enable option
          rampitec

          FMA instruction is IEEE compliant. Even though -cl-mad-enable does not mandate compiler to use fma, in practice it should do so when it can see an opportunity provided the option.

          Could you please share driver version you are using (clinfo command output) and full list of options supplied to the compiler?

          Please also check that SW does not cache compiled binary and the recompilation actually happens with the new option supplied.

          Also, how did you check you have no fma instructions generated?

          Please note, if you want to make sure using fma instruction for a particular operation vs allowing compiler to use it when it chooses to do so, you can use OpenCL fma() builtin function.

            • Re: Maximum DP floating point throughput without -cl-mad-enable option
              ekondis

              When I use the "-cl-mad-enable" then FMA instructions are generated as expected. The problem is when I omit this option. In that case only separate Add & mul instructions are generated and performance drops by ~33% (very compute intensive kernel). More specifically, I provide a part of the ISA code expressing the types of instructions used captured by using the "-save-temps" option:

              By using "-cl-mad-enable":

              ...
                v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000320: D1CC0010 00022110
                v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000328: D1CC0002 00020502
                v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000330: D1CC000C 0002190C
                v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000338: D1CC0004 00020904
                v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000340: D1CC000A 0002150A
                v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000348: D1CC0008 00021108
                v_fma_f64     v[14:15], v[14:15], v[14:15], s[0:1]    // 000000000350: D1CC000E 00021D0E
                v_fma_f64     v[0:1], v[0:1], v[0:1], s[0:1]          // 000000000358: D1CC0000 00020100
                v_fma_f64     v[16:17], v[16:17], v[16:17], s[0:1]    // 000000000360: D1CC0010 00022110
                v_fma_f64     v[2:3], v[2:3], v[2:3], s[0:1]          // 000000000368: D1CC0002 00020502
                v_fma_f64     v[12:13], v[12:13], v[12:13], s[0:1]    // 000000000370: D1CC000C 0002190C
                v_fma_f64     v[4:5], v[4:5], v[4:5], s[0:1]          // 000000000378: D1CC0004 00020904
                v_fma_f64     v[10:11], v[10:11], v[10:11], s[0:1]    // 000000000380: D1CC000A 0002150A
                v_fma_f64     v[8:9], v[8:9], v[8:9], s[0:1]          // 000000000388: D1CC0008 00021108
              ...
              

              Without using "-cl-mad-enable":

              ...
                v_mul_f64     v[10:11], v[10:11], v[10:11]            // 000000000280: D281000A 0002150A
                v_mul_f64     v[8:9], v[8:9], v[8:9]                  // 000000000288: D2810008 00021108
                v_add_f64     v[14:15], v[15:16], s[0:1]              // 000000000290: D280000E 0000010F
                v_add_f64     v[0:1], v[0:1], s[0:1]                  // 000000000298: D2800000 00000100
                v_add_f64     v[16:17], v[17:18], s[0:1]              // 0000000002A0: D2800010 00000111
                v_add_f64     v[2:3], v[2:3], s[0:1]                  // 0000000002A8: D2800002 00000102
                v_mul_f64     v[12:13], v[12:13], v[12:13]            // 0000000002B0: D281000C 0002190C
                v_mul_f64     v[4:5], v[4:5], v[4:5]                  // 0000000002B8: D2810004 00020904
                v_add_f64     v[10:11], v[10:11], s[0:1]              // 0000000002C0: D280000A 0000010A
                v_add_f64     v[8:9], v[8:9], s[0:1]                  // 0000000002C8: D2800008 00000108
                v_mul_f64     v[14:15], v[14:15], v[14:15]            // 0000000002D0: D281000E 00021D0E
                v_mul_f64     v[0:1], v[0:1], v[0:1]                  // 0000000002D8: D2810000 00020100
                v_mul_f64     v[16:17], v[16:17], v[16:17]            // 0000000002E0: D2810010 00022110
                v_mul_f64     v[2:3], v[2:3], v[2:3]                  // 0000000002E8: D2810002 00020502
              ...
              

              The only compilation specific extra option used for the compilation was "-cl-std=CL1.1"

              Driver version used: 1912.5 (VM)

              Platform: 64bit Linux (Ubuntu 14.04)

              In contrast, when run the same kernel on NVidia GPU then FMA instructions are produced in both cases.

                • Re: Maximum DP floating point throughput without -cl-mad-enable option
                  rampitec

                  Oh, I see what you mean. There are two kinds of fused multiply add operations: fma and mad/mac. Mad/mac are only available as f32 and yield exactly the same result as separate multiply and add. In turn fma can be f32 and f64 and has higher accuracy than separate operations. If you are using fma you may not get exactly the same result as if you do not. Then on some parts fma is also slower. Therefor mad/mac are generated by the compiler automatically for single precision, but fma for double precision only upon request.