5 Replies Latest reply on Aug 7, 2010 5:21 AM by coggy

    Manual implementation of trigonometric functions on doubles


      Hi, I'm currently using OpenCL on a Radeon HD 5870 for an academic project doing calculations on doubles. As the only operations currently supported on doubles are addition, subtraction and multiplication, I am thinking about implementing at least cos(), possibly other functions manually.

      Does anyone have any experience implementing trigonometric functions using OpenCL, e.g. the Cordic algorithm? Can I expect to achieve reasonable performance this way?

      Thanks for any answers.

        • Manual implementation of trigonometric functions on doubles

          You'll probably have to do it using the Newton-Raphson method (an iterative method), and the existing single-precision trig functions. In addition to some subtracts, and divides, you'll have to call the single precision version several times. So, to answer your question, I would expect a double-precision trig function to be an order of magnitude or so slower than single.


            • Manual implementation of trigonometric functions on doubles

              The current single precision trig functions are implemented without using the trig instructions. The trig instructions are not accurate enough to meet the OpenCL requirements (they're about 2e-3, but opencl requires about 4e-7 max error).

              Sin and cos are best implemented with a taylor series expansion. I believe 4 terms is all that is necessary to achieve OpenCL accuracy requirements for float, and about 10 terms for double. You'll need to do argument reduction and expand either sin or cos based on the octant...

              Best resource I have found for the math routines you might want is the Cephes mathematical library. Google it and you'll find very accurate implementations of all the functions you might need.




            • Manual implementation of trigonometric functions on doubles

              I met the same problem. Here is my implementation.

              A taylor series expansion for sin(x) in MATLAB as follows.

              fsin=@(x) x.^21/51090942171709440000 - x.^19/121645100408832000 + x.^17/355687428096000 - x.^15/1307674368000 + x.^13/6227020800 - x.^11/39916800 + x.^9/362880 - x.^7/5040 + x.^5/120 - x.^3/6 + x;

              Pa[] is a double2 arrary in constant memory.

              Pa[20].x ... Pa[24].y={1/51090942171709440000.0,-1/121645100408832000.0,1/355687428096000.0,-1/1307674368000.0,1/6227020800.0,-1/39916800.0,1/362880.0,-1/5040.0,1/120.0,-1/6.0}

              double sinf(double x, __constant double2 *Pa)
               double x2=x*x;
               double xk;
               xk = mad(x2,Pa[20].x,Pa[20].y);
               xk = mad(xk,x2,Pa[21].x);
               xk = mad(xk,x2,Pa[21].y);
               xk = mad(xk,x2,Pa[22].x);
               xk = mad(xk,x2,Pa[22].y);
               xk = mad(xk,x2,Pa[23].x);
               xk = mad(xk,x2,Pa[23].y);
               xk = mad(xk,x2,Pa[24].x);
               xk = mad(xk,x2,Pa[24].y);
               xk = mad(xk,x2,1.0f)*x;
               return xk;

              I use double2 instead of double to minimize fetching operations(5 fetches instead of 10 fetches).

              If you wana compute cos(x), you can using the following formula


              the max error for x

              [0,pi/2]  sin(x):1.1102e-016   cos(x):1.1102e-016

              [0,pi]     sin(x)  1.0348e-011  cos(x)  1.1102e-016

              [-pi/2,pi/2] sin(x)  1.1102e-016 cos(x)  1.0348e-011



                • Manual implementation of trigonometric functions on doubles

                  After  the horner transformation

                  fsin(x)=x*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2/51090942171709440000 - 1/121645100408832000) + 1/355687428096000) - 1/1307674368000) + 1/6227020800) - 1/39916800) + 1/362880) - 1/5040) + 1/120) - 1/6) + 1)

                  MUL_64  2 ops

                  FMA_64  10 ops

                  total       12 ALU + 5Fetches

                  If you do this computation many times, you need to put the constant in GPR. However, it seems that the compiler always try to minimize the usage of GPR. So I cann't make it happen.

                  The ALU operation listed as follows

                  02 ALU: ADDR(58) CNT(52)
                       14  x: MUL_64      T0.x,  R6.y,  R6.y     
                           y: MUL_64      T0.y,  R6.y,  R6.y     
                           z: MUL_64      ____,  R6.y,  R6.y     
                           w: MUL_64      ____,  R6.x,  R6.x     
                           t: ADD_INT     R8.z,  R6.z,  R1.z     
                       15  x: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
                           y: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
                           z: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
                           w: FMA_64      ____,  PV14.x,  R3.x,  R3.z     
                           t: ADD_INT     R8.w,  R6.w,  R1.z     
                       16  x: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
                           y: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
                           z: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
                           w: FMA_64      ____,  PV15.x,  T0.x,  R2.x     
                       17  x: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
                           y: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
                           z: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
                           w: FMA_64      ____,  PV16.x,  T0.x,  R2.z     
                       18  x: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
                           y: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
                           z: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
                           w: FMA_64      ____,  PV17.x,  T0.x,  R4.x     
                       19  x: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
                           y: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
                           z: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
                           w: FMA_64      ____,  PV18.x,  T0.x,  R4.z     
                       20  x: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
                           y: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
                           z: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
                           w: FMA_64      ____,  PV19.x,  T0.x,  R0.x     
                       21  x: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
                           y: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
                           z: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
                           w: FMA_64      ____,  PV20.x,  T0.x,  R0.z     
                       22  x: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
                           y: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
                           z: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
                           w: FMA_64      ____,  PV21.x,  T0.x,  R5.x     
                       23  x: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
                           y: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
                           z: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
                           w: FMA_64      ____,  PV22.x,  T0.x,  R5.z     
                       24  x: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
                           y: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
                           z: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
                           w: FMA_64      ____,  PV23.x,  T0.x,  R1.x     
                       25  x: MUL_64      ____,  PV24.y,  R6.y     
                           y: MUL_64      ____,  PV24.y,  R6.y     
                           z: MUL_64      ____,  PV24.y,  R6.y     
                           w: MUL_64      ____,  PV24.x,  R6.x