Archives Discussions

karx11erx · ‎05-14-2009

This post is an evaluation of what I have achieved and experienced using ATI's Stream Computing, and maybe I get some feedback or even some hints to where I went wrong (if I did) and how to do better.

Task: Use Stream Computing to accelerate transformation of geographical coordinates

Test machine: Intel C2D E8500, Radeon HD 4870 1GB, 4 GB DDR2 RAM 1066 MHz, WinXP SP 3

1. Tools

Flaws and bugs in Brook+ pose a few problems particularly to the newbie to ATI's Stream Computing SDK and tools, but thankfully these were overcome with the help of a few knowledgable forum members.

2. Precision

When evaluating the results of an initial coordinate transformation test implementation using single precision floating point numbers (float) it turned out that compute errors summed up so badly that the results were unuseable.

Transition to double had the big problem that the built-in math functions all work with float, causing the precision problem to arise again. It turned out that while I could provide double versions of most of these functions, not all could be replaced, so precision problems remained. Due to the nature of the subject (and me not being an expert in numerical mathematics and its application on computer systems) I have not been able to solve these.

3. Speed

Transforming a coordinate in my test case involves quite a few steps. I have made sure that data remained on the Stream Computing hardware (gfx card). Still, with all optimizations I could apply, I only achieved a threefold speedup compared to the CPU code path (the original coordinate transformation code), and even that only for large numbers of coordinates (the break even point was at around 1 million double precision coordinates, i.e. 3 million double values (x,y plus zone specifier)).

I have to say that I don't know whether I could improve on this, because I haven't really understood code optimization like explained in the optimized matrix multiplication example, but I doubt that speed could be more than doubled that way.

So as a result, unless I have massively goofed up somewhere (which I wouldn't be able to detect), Stream Computing has proven to be useless for my application: It lacks the speed and most of all the numerical precision required.

That is pretty disappointing I have to say. Maybe this will change should there ever be Stream Computing hardware with full native double precision support.

Thanks for all the help I have received here.

karx11erx

gaurav_garg · ‎05-15-2009

Is it possible for you to post your code here? I can suggest some optimizations and we can try to optimize the code in an iterative and incremental way.

karx11erx · ‎05-15-2009

Ok. It will going to be a lengthy post. I will try to explain it as good as possible to give an overview of what I have done and where potential problems are. The base of my code is the public domain proj library.

Here we go:

Stream Computing in the proj library

proj provides a big bunch of geographical transformations using a generalized interface and processing path.

It seems to have been written by people who knew about geographical transformations and how to handle C, but not how to cleanly design software.
The code base therefore is a bit intransparent with a lot of macros being used to make adding new stuff easier for people who know the code well.

Transformation Flow
===================

The general flow of a transformation looks somewhat like that

inverse {
   preprocessing
   system specific transformation (via function pointer)
   postprocessing
   }
datum transform {
    apply gridshift (CPU only, requires copying streams back to CPU memory and copying results to streams again)
    geodetic -> geocentric
   geocentric <- WGS84
   geocentric -> geodetic
   }
forward {
   preprocessing
   system specific transformation (via function pointer)
   postprocessing
   }


Data Handling (Buffers)
=======================

proj uses three input buffers double *x, *y, *z. In some functions these are read/write, in others they are read only and the functions return the results.
Above that values can be stored with a fixed stride in these buffers, so the i-th value would be found in x [i * stride]. In order to reduce coordinate
data being moved to and from temporary locations and make the code easier to handle, I added buffers where the coordinates were stored without stride:

double2 *xyBuf;
double* zBuf;

The code has been rewritten by me to take advantage of that.

Adding stream computing
=======================

For each of the above steps, a kernel function containing the math had to be written. In order to avoid stream aliasing, pairs of streams equivalent to the
above coordinate buffers are used:

::brook::Stream*   xyStream [2];
::brook::Stream*   zStream [2];

As there seems to be no way to initialize a stream other then when declaring or allocating it, I had to use pointers.

The streams are alternatingly used as source and destination streams by indexing them with

int nDestXY = 0;
int nDestZ = 0;

These are toggled before each call to a kernel function:

nDestXY = !nDestXY;
nDestZ = !nDestZ;

proj_kernel (xyStream [!nDestXY], zStream [!nDestZ], xyStream [nDestXY], zStream [nDestZ], ...);

This will result in the results stream of processing step N being the input stream for processing step N+1. xyStream [0] and zStream [0] are initially filled
with the input data that has been passed to the transformation function. Data is copied to the streams, then the transformation code path is executed, then
result data is copied back to CPU memory. During transformation, data has to be copied to CPU memory and back to GPU memory once because some data processing
(currently) can only be done by the CPU.

Math
====

It has turned out that single precision math (float) yields too big differences compared to the CPU paths double precision math (double), so I implemented
double versions of the following functions:

#define HUGE 1.0e300   // would actually need INFINITY

//------------------------------------------------------------------------------

kernel double Fabs (double d)
{
if (d >= 0.0)
   return d;
return -d;
}

//------------------------------------------------------------------------------

kernel double Sqrt (double x) // approx. of double precision square root using single prec. reciprocal square root
{
double y = (double) rsqrt ((float) x);
return 2.0 / (y * (3.0 - (x * (y * y))));
}

//------------------------------------------------------------------------------

kernel double Rsqrt (double x) // approx. of double precision reciprocal square root
{
double y = (double) rsqrt ((float) x);
return 0.5 * (y * (3.0 - (x * (y * y))));
}

//------------------------------------------------------------------------------
Approximating sin and cos using Abramowitz, Stegun formulas was not precise enough, so Taylor series had to do.

kernel double Sin (double phi<>)
{
double phi2 = phi * phi;
return phi * (1.0 - phi2 / 6.0 * (1.0 - phi2 / 20.0 * (1.0 - phi2 / 42.0 * (1.0 - phi2 / 72.0 * (1.0 - phi2 / 110.0 * (1.0 - phi2 / 156.0))))));
}

//------------------------------------------------------------------------------

kernel double Cos (double phi<>)
{
double phi2 = phi * phi;
return 1.0 - phi2 / 2.0 * (1.0 - phi2 / 12.0 * (1.0 - phi2 / 30.0 * (1.0 - phi2 / 56.0 * (1.0 - phi2 / 90.0 * (1.0 - phi2 / 132.0)))));
}

//------------------------------------------------------------------------------

kernel double Tan (double phi<>)
{
   double cosphi = Cos (phi);

if (cosphi == 0.0)
   return HUGE;
else
   return Sin (phi) / Cosphi;
}

//------------------------------------------------------------------------------

#define sq2p1   2.414213562373095048802e0
#define sq2m1   0.414213562373095048802e0
#define p4       0.161536412982230228262e2
#define p3       0.26842548195503973794141e3
#define p2       0.11530293515404850115428136e4
#define p1       0.178040631643319697105464587e4
#define p0       0.89678597403663861959987488e3
#define q4       0.5895697050844462222791e2
#define q3       0.536265374031215315104235e3
#define q2       0.16667838148816337184521798e4
#define q1       0.207933497444540981287275926e4
#define q0       0.89678597403663861962481162e3
#define PIO2   1.5707963267948966135e0
#define nan       (0.0/0.0)

kernel double mxatan (double phi<>)
{
   double phi2;

    double value;

phi2 = phi * phi;
value = ((((p4 * phi2 + p3) * phi2 + p2) * phi2 + p1) * phi2 + p0);
value = value / (((((phi2 + q4) * phi2 + q3) * phi2 + q2) * phi2 + q1) * phi2 + q0);
return value * phi;
}

kernel double msatan (double phi<>)
{
if(phi < sq2m1)
    return mxatan (phi);
else if (phi > sq2p1)
    return PIO2 - mxatan (1.0 / phi);
else
   return PIO2 / 2.0 + mxatan ((phi - 1.0) / (phi + 1.0));
}

kernel double Atan (double phi<>)
{
if (phi > 0.0)
   return msatan (phi);
else
   return -msatan (-phi);
}

//------------------------------------------------------------------------------

kernel double Atan2 (double y<>, double x<>)
{
   double   a;

if (Fabs (x) > Fabs (y))
   a = Atan (y / x);
else {
   a = Atan (x / y);
   if (a < 0.0)
       a = -PI / 2.0 - a;
   else
       a = PI / 2.0 - a;
   }
if (x < 0.0) {
   if (y < 0.0)
       a -= PI;
   else
       a += PI;
   }
return a;
}

//------------------------------------------------------------------------------

Transformation kernels (example)
================================

1. General pre- and post-processing
-----------------------------------

// general forward projection preprocessing

kernel void pj_fwd_pre (double2 xy<>, out double2 lp<>, double lam0, double rone_es, int geoc, int over)
{
double t = xy.y;
t = Fabs (t) - HALFPI;
if (t > EPS) {
   lp.x = HUGE;
   lp.y = HUGE;
   }
else if (Fabs (lp.x) > 10.0) {
   lp.x = HUGE;
   lp.y = HUGE;
   }
else {
   if (Fabs (t) <= EPS) {
       if (xy.y < 0.0)
           lp.y = -HALFPI;
       else
           lp.y = HALFPI;
       }
   else if (geoc)
       lp.y = Atan (rone_es * Tan (xy.y));
   else
       lp.y = xy.y;
   if (over)
       lp.x = xy.x - lam0;
   else
       lp.x = adjlon (xy.x - lam0);
   }
}

// general forward projection post processing

kernel void pj_fwd_post (double2 lp<>, out double2 xy<>, double fr_meter, double x0, double y0, double ra)
{
xy.x = fr_meter * (ra * lp.x + x0);
xy.y = fr_meter * (ra * lp.y + y0);
}

2. A transformation
-------------------

#define HALFPI   1.5707963267948966
#define FORTPI   0.78539816339744833
#define PI       3.14159265358979323846
#define SPI    3.14159265359
#define TWOPI   6.2831853071795864769
#define EPS       1.0e-12
#define HUGE   1.0e300

#define MAX_ITER   10

kernel double pj_inv_mlfn (double arg<>, double es, double en0, double en1, double en2, double en3, double en4)
{
   double s, t, phi, k = 1.0 / (1.0 - es);
   int i;

phi = arg;
for (i = MAX_ITER; i; --i) { /* rarely goes over 2 iterations */
   s = Sin (phi);
   t = 1.0 - es * s * s;
   t = (pj_mlfn (phi, s, Cos (phi), en0, en1, en2, en3, en4) - arg) * (t * Sqrt (t)) * k;
   phi -= t;
   if (Fabs (t) < EPS)
       return phi;
   }
return HUGE;
}

#define FC1       1.0
#define FC2       0.5
#define FC3       0.16666666666666666666
#define FC4       0.08333333333333333333
#define FC5       0.05
#define FC6       0.03333333333333333333
#define FC7       0.02380952380952380952
#define FC8       0.01785714285714285714

kernel void pj_tmerc_e_inverse (double2 xy<>, out double2 lp<>, double es, double esp, double k0, double en0, double en1, double en2, double en3, double en4, double ml0)
{
   double n, con, cosphi, d, ds, sinphi, t;

lp.y = pj_inv_mlfn (ml0 + xy.y / k0, es, en0, en1, en2, en3, en4);
if (Fabs (lp.y) >= HALFPI) {
   if (xy.y < 0.0) {
       lp.y = -HALFPI;
       }
   else {
       lp.y = HALFPI;
       }
   lp.x = 0.0;
   return;
   }

sinphi = Sin (lp.y);
cosphi = Cos (lp.y);
n = esp * cosphi * cosphi;
con = 1.0 - es * sinphi * sinphi;
d = xy.x * Sqrt (con) / k0;
ds = d * d;

if (Fabs (cosphi) > 1.0e-10) {
   t = sinphi / cosphi;
   }
else {
   t = 0.0;
   }

if (t == 0.0) {
   lp.x = d * (FC1
               - ds * FC3 * (1.0 + 2.0 * t + n
               - ds * FC5 * (5.0 + 6.0 * n
               - ds * FC7 * 61.0)))
              / cosphi;
   return;
   }

con *= t;
t *= t;
lp.y -= (con * ds / (1.0 - es)) * FC2 * (1.0
       - ds * FC4 * (5.0 + t * (3.0 - 9.0 * n) + n * (1.0 - 4.0 * n)
       - ds * FC6 * (61.0 + t * (90.0 - 252.0 * n + 45.0 * t) + 46.0 * n
       - ds * FC8 * (1385.0 + t * (3633.0 + t * (4095.0 + 1574.0 * t))))));
lp.x = d * (FC1
           - ds * FC3 * (1.0 + 2.0 * t + n
           - ds * FC5 * (5.0 + t * (28.0 + 24.0 * t + 8.0 * n) + 6.0 * n
           - ds * FC7 * (61.0 + t * (662.0 + t * (1320.0 + 720.0 * t))))))
       / cosphi;
}

//------------------------------------------------------------------------------

The real killer is the precision loss which is particularly caused by the approximated square root function.

Speed also is not good enough to make use of stream computing desirable for the intended application of the proj library. I do however have the impression that speed more depends on buffer management than on computations, because the relation between the increase of coordinates and the increase of compute time is not linear (like 1,000 coordinates -> 1000 ms, 1,000,000 coords -> 1100 ms). So the problem may be some buffer management (or whatever), not actually executing the kernels.

I am currently retesting this.

Edit:

This is interesting.

12351 coords: 1281 ms
1235100 coords: 1469 ms

I have then executed the test case two times in a loop and taken the time for the second iteration. Now, stream computing is around 7 times faster than the CPU path. I still had expected more, but apparently something is going on the first time the code is executed that doesn't happen the second time. Shader program loading? Can I somehow make stuff like that happen prior to first execution of the transformation code?

Also, can I optimize my above code? Is more parallelization possible (I've seen these .xxxx and .yyyy accesses in the matmult code - can I do something like that for my doubles too and gain more speed)?

gaurav_garg · ‎05-15-2009

Yes, you are right. It is shader compilation and symbol querying that is cached for further calls of the same kernels, but there doesn't seems to be a way to avoid using online compilation until you want to modify Brook+ runtime for offline loading of an IL image.

Your code already looks like true SIMD in nature and given the amount of computation in your code, memory access is definitely not an overhead.

Could you also post your runtime code? I would be in better position to suggest some optimizations.

karx11erx · ‎05-15-2009

Well, if I know that some pre-caching causes the slow down, I will simply call a dummy transformation when the transformation app is loaded.

What do you mean with runtime code?

Listing the actual code executed during my test transformations is probably a bit beyond something I could post in a forum.

gaurav_garg · ‎05-15-2009

I mean the host side code. If you can use dummy calls of kernels at load time that would be greate.

karx11erx · ‎05-15-2009

I could upload a rar file containing the complete project for you (it's not that big). You should be able to compile and execute it after adjusting the brook lib and tool paths with MS Visual C 2008.

Be advised though that the basic code isn't the prettiest, nor is my test code. It's pretty straight forward and not hard to understand though.

http://www.descent2.de/downloads/proj-4.6.1.rar

gaurav_garg · ‎05-15-2009

Amount of code is much more than what I had expected. I would try to take a look at it if I get some free cycles.

karx11erx · ‎05-15-2009

Gaurav,

The execution path for my test case does not involve all the files contained in the project. There's about a file per projection (and proj provides a *lot* of projections). You actually only need 5 source files, or so (proj, transform, merc, tmerc, geocent plus the related brook generated cpp files).

I am not exactly a coding beginner. There's not much you can squeeze out of that code.I provided you with the project so that you could take a look in it if you really liked to. Of course you have no obligation to look into it.

karx

karx11erx · ‎06-03-2009

I could triple execution speed by using a 2-dimensional stream with a first dimension of 4 (more doesn't yield any extra speed).

Archives Discussions

Stream Computing - a sobering experience