4 Replies Latest reply on Apr 12, 2013 11:13 AM by stuart.rogers

    ACML FFT Reproducible?

    stuart.rogers

      Is the ACML FFT using mode 100, which finds an optimal plan for a specific problem size, reproducible across different processors? That is, will the results of the FFT (using mode 100) agree between different types of processors? Is mode 0 FFT, which uses a default plan, reproducible across processors?

        • Re: ACML FFT Reproducible?
          chipf

          For mode 100 it might be reproducable, but it can't be guaranteed.   Depending on the problem size, different radix plans could be chosen for different processors.  This could lead to slightly different results.

           

          For mode 0, we will use the same default radix plans.  For mode 0, when comparing on machines that will use the same instruction set, results should be reproducable. But there could differences in results between older SSE only processors and the newer FMA capable processors.

          1 of 1 people found this helpful
            • Re: ACML FFT Reproducible?
              stuart.rogers

              I'm not familiar with radix plans. If the FFT sizes are a power of 2 with mode 100, will they be reproducible across AMD processors? Power-of-2 FFTs (using either mode 100 or 0) should not be expected to be reproducible between AMD and Intel processsors, right?

                • Re: ACML FFT Reproducible?
                  chipf

                  Assuming that the problem size is a product of small prime factors, a radix plan is the set of small prime FFT passes that combine to run the entire FFT.   The radix plan includes the order in which passes are made. For instance 32 is 2^5, we might use radix 4 then 8, or we might use 8 then 4.   The order of operation would be different, and that might cause small differences in the results.  Mode 100 will look at all  (or most) of the possible prime factor combinations, time them, and choose the fastest.  The fastest radix plan can be vary between different machines. 

                  ACML includes code for radices 2, 3, 4, 5, 7, and 8, and even some Fortran implementions such as 11 and 13.

                   

                  If both AMD and intel machines are using SSE instructions, and if they both use the same radix plan, then they should produce the same results.  When and if intel makes FMA available, then both machines using FMA for the same radix plan should produce the same results.  We don't provide visibility into which plan is chosen for mode 100, but for mode 0, it would be the same for both machines.

                   

                  FMA doesn't always help to improve FFT performance, so if it's causing differences in results be between processors you can try turning it off by setting the environment variable ACML_FMA=0.  There may or may not be a performance penalty in doing this on the AMD machine.

                  1 of 1 people found this helpful
                    • Re: ACML FFT Reproducible?
                      stuart.rogers

                      I'm not convinced the Intel and AMD FFT results should be expected to be the same, assuming the same instruction set and radix plan. Below is an excerpt from an email from Matteo Frigo, in response to a question about FFTW reproducibility. Does Intel Sandy Bridge and AMD Phenom II use the same instruction set?

                       

                      In your quest for reproducibility you should be warned that Intel and AMD processors are not exactly equivalent to begin with.  In particular, trascendental functions such as sin() and cos() are hard to compute to within machine precision, and different processors may compute a different approximation to sin(x) and cos(x).

                       

                      For example, the program below prints 11b2b1c6227d838e on sandy bridge and 9ab1006d9fea70ab on a Phenom II processor (x86_64-linux-gnu-gcc-4.7 -O3). 

                       

                      Even more alarmingly, the answers from the two processors agree (in this particular case) if I compute only sin() or only cos().  It turns out that gcc replaces the sin/cos calls by the single hardware instruction fsincos, which behaves differently on the two machines, whereas individual sin() and cos() agree (in this case).

                       

                      In the past I was able to produce an explicit value of x such as cos(x) was different on the two machines, but I haven't checked recently.  The value of x was not particularly complicated (something like x=2*PI*N/M for M=some small prime).

                       

                      Intel has changed their trigonometric algorithms at least once in the past in a documented way.  I don't know if they maintain absolute compatibility within their own processors, and I have no idea of what AMD is doing.

                       

                      ------------------------------------------------------------

                      #include <math.h>

                      #include <stdio.h>

                       

                      #define HASH(thing) y.d = thing; chk = (chk * 17) ^ (y.ll);

                       

                      int main(int argc, char *argv[])

                      {

                           int i, n;

                           union {

                                double d;

                                unsigned long long ll;

                           } y;

                           unsigned long long chk = 0;

                       

                           for (n = 1; n < 10000; ++n) {

                                for (i = 0; i < n; ++i) {

                      double x = (double)i / (double)n;

                      HASH(cos(x));

                      HASH(sin(x));

                                }

                           }

                       

                      printf("%llx\n", chk);

                           return 0;

                      }