9 Replies Latest reply on Jun 3, 2009 9:48 AM by karx11erx

    Stream Computing - a sobering experience

    karx11erx

      This post is an evaluation of what I have achieved and experienced using ATI's Stream Computing, and maybe I get some feedback or even some hints to where I went wrong (if I did) and how to do better.

       

      Task: Use Stream Computing to accelerate transformation of geographical coordinates

      Test machine: Intel C2D E8500, Radeon HD 4870 1GB, 4 GB DDR2 RAM 1066 MHz, WinXP SP 3

       

      1. Tools

      Flaws and bugs in Brook+ pose a few problems particularly to the newbie to ATI's Stream Computing SDK and tools, but thankfully these were overcome with the help of a few knowledgable forum members.

       

      2. Precision

      When evaluating the results of an initial coordinate transformation test implementation using single precision floating point numbers (float) it turned out that compute errors summed up so badly that the results were unuseable.

      Transition to double had the big problem that the built-in math functions all work with float, causing the precision problem to arise again. It turned out that while I could provide double versions of most of these functions, not all could be replaced, so precision problems remained. Due to the nature of the subject (and me not being an expert in numerical mathematics and its application on computer systems) I have not been able to solve these.

       

      3. Speed

      Transforming a coordinate in my test case involves quite a few steps. I have made sure that data remained on the Stream Computing hardware (gfx card). Still, with all optimizations I could apply, I only achieved a threefold speedup compared to the CPU code path (the original coordinate transformation code), and even that only for large numbers of coordinates (the break even point was at around 1 million double precision coordinates, i.e. 3 million double values (x,y plus zone specifier)).

      I have to say that I don't know whether I could improve on this, because I haven't really understood code optimization like explained in the optimized matrix multiplication example, but I doubt that speed could be more than doubled that way.

       

      So as a result, unless I have massively goofed up somewhere (which I wouldn't be able to detect), Stream Computing has proven to be useless for my application: It lacks the speed and most of all the numerical precision required.

      That is pretty disappointing I have to say. Maybe this will change should there ever be Stream Computing hardware with full native double precision support.

      Thanks for all the help I have received here.

      karx11erx

        • Stream Computing - a sobering experience
          gaurav.garg

          Is it possible for you to post your code here? I can suggest some optimizations and we can try to optimize the code in an iterative and incremental way.

            • Stream Computing - a sobering experience
              karx11erx

              Ok. It will going to be a lengthy post. I will try to explain it as good as possible to give an overview of what I have done and where potential problems are. The base of my code is the public domain proj library.

              Here we go:

               

              Stream Computing in the proj library

              proj provides a big bunch of geographical transformations using a generalized interface and processing path.

              It seems to have been written by people who knew about geographical transformations and how to handle C, but not how to cleanly design software.
              The code base therefore is a bit intransparent with a lot of macros being used to make adding new stuff easier for people who know the code well.


              Transformation Flow
              ===================

              The general flow of a transformation looks somewhat like that

              inverse {
                  preprocessing
                  system specific transformation (via function pointer)
                  postprocessing
                  }
              datum transform {
                  apply gridshift (CPU only, requires copying streams back to CPU memory and copying results to streams again)
                  geodetic -> geocentric
                  geocentric <- WGS84
                  geocentric -> geodetic
                  }
              forward {
                  preprocessing
                  system specific transformation (via function pointer)
                  postprocessing
                  }
                  

              Data Handling (Buffers)
              =======================

              proj uses three input buffers double *x, *y, *z. In some functions these are read/write, in others they are read only and the functions return the results.
              Above that values can be stored with a fixed stride in these buffers, so the i-th value would be found in x [i * stride]. In order to reduce coordinate
              data being moved to and from temporary locations and make the code easier to handle, I added buffers where the coordinates were stored without stride:

              double2 *xyBuf;
              double* zBuf;

              The code has been rewritten by me to take advantage of that.


              Adding stream computing
              =======================

              For each of the above steps, a kernel function containing the math had to be written. In order to avoid stream aliasing, pairs of streams equivalent to the
              above coordinate buffers are used:

              ::brook::Stream*    xyStream [2];
              ::brook::Stream*    zStream [2];

              As there seems to be no way to initialize a stream other then when declaring or allocating it, I had to use pointers.

              The streams are alternatingly used as source and destination streams by indexing them with

              int nDestXY = 0;
              int nDestZ = 0;

              These are toggled before each call to a kernel function:

              nDestXY = !nDestXY;
              nDestZ = !nDestZ;

              proj_kernel (xyStream [!nDestXY], zStream [!nDestZ], xyStream [nDestXY], zStream [nDestZ], ...);

              This will result in the results stream of processing step N being the input stream for processing step N+1. xyStream [0] and zStream [0] are initially filled
              with the input data that has been passed to the transformation function. Data is copied to the streams, then the transformation code path is executed, then
              result data is copied back to CPU memory. During transformation, data has to be copied to CPU memory and back to GPU memory once because some data processing
              (currently) can only be done by the CPU.



              Math
              ====

              It has turned out that single precision math (float) yields too big differences compared to the CPU paths double precision math (double), so I implemented
              double versions of the following functions:


              #define HUGE 1.0e300    // would actually need INFINITY

              //------------------------------------------------------------------------------

              kernel double Fabs (double d)
              {
              if (d >= 0.0)
                  return d;
              return -d;
              }

              //------------------------------------------------------------------------------

              kernel double Sqrt (double x) // approx. of double precision square root using single prec. reciprocal square root
              {
              double y = (double) rsqrt ((float) x);
              return 2.0 / (y * (3.0 - (x * (y * y))));
              }

              //------------------------------------------------------------------------------

              kernel double Rsqrt (double x) // approx. of double precision reciprocal square root
              {
              double y = (double) rsqrt ((float) x);
              return 0.5 * (y * (3.0 - (x * (y * y))));
              }

              //------------------------------------------------------------------------------
              Approximating sin and cos using Abramowitz, Stegun formulas was not precise enough, so Taylor series had to do.

              kernel double Sin (double phi<>)
              {
              double phi2 = phi * phi;
              return phi * (1.0 - phi2 / 6.0 * (1.0 - phi2 / 20.0 * (1.0 - phi2 / 42.0 * (1.0 - phi2 / 72.0 * (1.0 - phi2 / 110.0 * (1.0 - phi2 / 156.0))))));
              }

              //------------------------------------------------------------------------------

              kernel double Cos (double phi<>)
              {
              double phi2 = phi * phi;
              return 1.0 - phi2 / 2.0 * (1.0 - phi2 / 12.0 * (1.0 - phi2 / 30.0 * (1.0 - phi2 / 56.0 * (1.0 - phi2 / 90.0 * (1.0 - phi2 / 132.0)))));
              }

              //------------------------------------------------------------------------------

              kernel double Tan (double phi<>)
              {
                  double cosphi = Cos (phi);
                  
              if (cosphi == 0.0)
                  return HUGE;
              else
                  return Sin (phi) / Cosphi;
              }

              //------------------------------------------------------------------------------

              #define sq2p1    2.414213562373095048802e0
              #define sq2m1   0.414213562373095048802e0
              #define p4        0.161536412982230228262e2
              #define p3        0.26842548195503973794141e3
              #define p2        0.11530293515404850115428136e4
              #define p1        0.178040631643319697105464587e4
              #define p0        0.89678597403663861959987488e3
              #define q4        0.5895697050844462222791e2
              #define q3        0.536265374031215315104235e3
              #define q2        0.16667838148816337184521798e4
              #define q1        0.207933497444540981287275926e4
              #define q0        0.89678597403663861962481162e3
              #define PIO2    1.5707963267948966135e0
              #define nan        (0.0/0.0)

              kernel double mxatan (double phi<>)
              {
                  double phi2;

                  double value;

              phi2 = phi * phi;
              value = ((((p4 * phi2 + p3) * phi2 + p2) * phi2 + p1) * phi2 + p0);
              value = value / (((((phi2 + q4) * phi2 + q3) * phi2 + q2) * phi2 + q1) * phi2 + q0);
              return value * phi;
              }

              kernel double msatan (double phi<>)
              {
              if(phi < sq2m1)
                  return mxatan (phi);
              else if (phi > sq2p1)
                  return PIO2 - mxatan (1.0 / phi);
              else
                  return PIO2 / 2.0 + mxatan ((phi - 1.0) / (phi + 1.0));
              }

              kernel double Atan (double phi<>)
              {
              if (phi > 0.0)
                  return msatan (phi);
              else
                  return -msatan (-phi);
              }

              //------------------------------------------------------------------------------

              kernel double Atan2 (double y<>, double x<>)
              {
                  double    a;

              if (Fabs (x) > Fabs (y))
                  a = Atan (y / x);
              else {
                  a = Atan (x / y);
                  if (a < 0.0)
                      a = -PI / 2.0 - a;
                  else
                      a = PI / 2.0 - a;
                  }
              if (x < 0.0) {
                 if (y < 0.0)
                      a -= PI;
                  else
                      a += PI;
                  }    
              return a;
              }

              //------------------------------------------------------------------------------

              Transformation kernels (example)
              ================================


              1. General pre- and post-processing
              -----------------------------------

              // general forward projection preprocessing

              kernel void pj_fwd_pre (double2 xy<>, out double2 lp<>, double lam0, double rone_es, int geoc, int over)
              {
              double t = xy.y;
              t = Fabs (t) - HALFPI;
              if (t > EPS) {
                  lp.x = HUGE;
                  lp.y = HUGE;
                  }
              else if (Fabs (lp.x) > 10.0) {
                  lp.x = HUGE;
                  lp.y = HUGE;
                  }
              else {
                  if (Fabs (t) <= EPS) {
                      if (xy.y < 0.0)
                          lp.y = -HALFPI;
                      else
                          lp.y = HALFPI;
                      }
                  else if (geoc)
                      lp.y = Atan (rone_es * Tan (xy.y));
                  else
                      lp.y = xy.y;
                  if (over)
                      lp.x = xy.x - lam0;
                  else
                      lp.x = adjlon (xy.x - lam0);
                  }
              }


              // general forward projection post processing

              kernel void pj_fwd_post (double2 lp<>, out double2 xy<>, double fr_meter, double x0, double y0, double ra)
              {
              xy.x = fr_meter * (ra * lp.x + x0);
              xy.y = fr_meter * (ra * lp.y + y0);
              }



              2. A transformation
              -------------------


              #define HALFPI    1.5707963267948966
              #define FORTPI    0.78539816339744833
              #define PI        3.14159265358979323846
              #define SPI    3.14159265359
              #define TWOPI    6.2831853071795864769
              #define EPS        1.0e-12
              #define HUGE    1.0e300

              #define MAX_ITER    10

              kernel double pj_inv_mlfn (double arg<>, double es, double en0, double en1, double en2, double en3, double en4)
              {
                  double s, t, phi, k = 1.0 / (1.0 - es);
                  int i;

              phi = arg;
              for (i = MAX_ITER; i; --i) { /* rarely goes over 2 iterations */
                  s = Sin (phi);
                  t = 1.0 - es * s * s;
                  t = (pj_mlfn (phi, s, Cos (phi), en0, en1, en2, en3, en4) - arg) * (t * Sqrt (t)) * k;
                  phi -= t;
                  if (Fabs (t) < EPS)
                      return phi;
                  }
              return HUGE;
              }


              #define FC1        1.0
              #define FC2        0.5
              #define FC3        0.16666666666666666666
              #define FC4        0.08333333333333333333
              #define FC5        0.05
              #define FC6        0.03333333333333333333
              #define FC7        0.02380952380952380952
              #define FC8        0.01785714285714285714

              kernel void pj_tmerc_e_inverse (double2 xy<>, out double2 lp<>, double es, double esp, double k0, double en0, double en1, double en2, double en3, double en4, double ml0)
              {
                  double n, con, cosphi, d, ds, sinphi, t;

              lp.y = pj_inv_mlfn (ml0 + xy.y / k0, es, en0, en1, en2, en3, en4);
              if (Fabs (lp.y) >= HALFPI) {
                  if (xy.y < 0.0) {
                      lp.y = -HALFPI;
                      }
                  else {
                      lp.y = HALFPI;
                      }
                  lp.x = 0.0;
                  return;
                  }

              sinphi = Sin (lp.y);
              cosphi = Cos (lp.y);
              n = esp * cosphi * cosphi;
              con = 1.0 - es * sinphi * sinphi;
              d = xy.x * Sqrt (con) / k0;
              ds = d * d;

              if (Fabs (cosphi) > 1.0e-10) {
                  t = sinphi / cosphi;
                  }
              else {
                  t = 0.0;
                  }

              if (t == 0.0) {
                  lp.x = d * (FC1
                              - ds * FC3 * (1.0 + 2.0 * t + n
                              - ds * FC5 * (5.0 + 6.0 * n
                              - ds * FC7 * 61.0)))
                            / cosphi;
                  return;
                  }
                  
              con *= t;
              t *= t;
              lp.y -= (con * ds / (1.0 - es)) * FC2 * (1.0
                       - ds * FC4 * (5.0 + t * (3.0 - 9.0 * n) + n * (1.0 - 4.0 * n)
                       - ds * FC6 * (61.0 + t * (90.0 - 252.0 * n + 45.0 * t) + 46.0 * n
                       - ds * FC8 * (1385.0 + t * (3633.0 + t * (4095.0 + 1574.0 * t))))));
              lp.x = d * (FC1
                          - ds * FC3 * (1.0 + 2.0 * t + n
                          - ds * FC5 * (5.0 + t * (28.0 + 24.0 * t + 8.0 * n) + 6.0 * n
                          - ds * FC7 * (61.0 + t * (662.0 + t * (1320.0 + 720.0 * t))))))
                       / cosphi;
              }

              //------------------------------------------------------------------------------

              The real killer is the precision loss which is particularly caused by the approximated square root function.

              Speed also is not good enough to make use of stream computing desirable for the intended application of the proj library. I do however have the impression that speed more depends on buffer management than on computations, because the relation between the increase of coordinates and the increase of compute time is not linear (like 1,000 coordinates -> 1000 ms, 1,000,000 coords -> 1100 ms). So the problem may be some buffer management (or whatever), not actually executing the kernels.

              I am currently retesting this.

              Edit:

              This is interesting.

              12351 coords: 1281 ms
              1235100 coords: 1469 ms

              I have then executed the test case two times in a loop and taken the time for the second iteration. Now, stream computing is around 7 times faster than the CPU path. I still had expected more, but apparently something is going on the first time the code is executed that doesn't happen the second time. Shader program loading? Can I somehow make stuff like that happen prior to first execution of the transformation code?

              Also, can I optimize my above code? Is more parallelization possible (I've seen these .xxxx and .yyyy accesses in the matmult code - can I do something like that for my doubles too and gain more speed)?