cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

coggy
Journeyman III

Manual implementation of trigonometric functions on doubles

Hi, I'm currently using OpenCL on a Radeon HD 5870 for an academic project doing calculations on doubles. As the only operations currently supported on doubles are addition, subtraction and multiplication, I am thinking about implementing at least cos(), possibly other functions manually.

Does anyone have any experience implementing trigonometric functions using OpenCL, e.g. the Cordic algorithm? Can I expect to achieve reasonable performance this way?

Thanks for any answers.

0 Likes
5 Replies
rick_weber
Adept II

You'll probably have to do it using the Newton-Raphson method (an iterative method), and the existing single-precision trig functions. In addition to some subtracts, and divides, you'll have to call the single precision version several times. So, to answer your question, I would expect a double-precision trig function to be an order of magnitude or so slower than single.

http://en.wikipedia.org/wiki/Newton%27s_method

0 Likes

The current single precision trig functions are implemented without using the trig instructions. The trig instructions are not accurate enough to meet the OpenCL requirements (they're about 2e-3, but opencl requires about 4e-7 max error).

Sin and cos are best implemented with a taylor series expansion. I believe 4 terms is all that is necessary to achieve OpenCL accuracy requirements for float, and about 10 terms for double. You'll need to do argument reduction and expand either sin or cos based on the octant...

Best resource I have found for the math routines you might want is the Cephes mathematical library. Google it and you'll find very accurate implementations of all the functions you might need.

 

Malcolm

 

0 Likes
fxxyjslqd
Journeyman III

I met the same problem. Here is my implementation.

A taylor series expansion for sin(x) in MATLAB as follows.

fsin=@(x) x.^21/51090942171709440000 - x.^19/121645100408832000 + x.^17/355687428096000 - x.^15/1307674368000 + x.^13/6227020800 - x.^11/39916800 + x.^9/362880 - x.^7/5040 + x.^5/120 - x.^3/6 + x;

Pa[] is a double2 arrary in constant memory.

Pa[20].x ... Pa[24].y={1/51090942171709440000.0,-1/121645100408832000.0,1/355687428096000.0,-1/1307674368000.0,1/6227020800.0,-1/39916800.0,1/362880.0,-1/5040.0,1/120.0,-1/6.0}

double sinf(double x, __constant double2 *Pa)
{
 double x2=x*x;
 double xk;
 xk = mad(x2,Pa[20].x,Pa[20].y);
 xk = mad(xk,x2,Pa[21].x);
 xk = mad(xk,x2,Pa[21].y);
 xk = mad(xk,x2,Pa[22].x);
 xk = mad(xk,x2,Pa[22].y);
 xk = mad(xk,x2,Pa[23].x);
 xk = mad(xk,x2,Pa[23].y);
 xk = mad(xk,x2,Pa[24].x);
 xk = mad(xk,x2,Pa[24].y);
 xk = mad(xk,x2,1.0f)*x;
 return xk;
}

I use double2 instead of double to minimize fetching operations(5 fetches instead of 10 fetches).

If you wana compute cos(x), you can using the following formula

cos(x)=sinf(pi/2-x).

the max error for x

[0,pi/2]  sin(x):1.1102e-016   cos(x):1.1102e-016

[0,pi]     sin(x)  1.0348e-011  cos(x)  1.1102e-016

[-pi/2,pi/2] sin(x)  1.1102e-016 cos(x)  1.0348e-011

 

 

0 Likes

After  the horner transformation

fsin(x)=x*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2/51090942171709440000 - 1/121645100408832000) + 1/355687428096000) - 1/1307674368000) + 1/6227020800) - 1/39916800) + 1/362880) - 1/5040) + 1/120) - 1/6) + 1)

MUL_64  2 ops

FMA_64  10 ops

total       12 ALU + 5Fetches

If you do this computation many times, you need to put the constant in GPR. However, it seems that the compiler always try to minimize the usage of GPR. So I cann't make it happen.

The ALU operation listed as follows

02 ALU: ADDR(58) CNT(52)
     14  x: MUL_64      T0.x,  R6.y,  R6.y     
         y: MUL_64      T0.y,  R6.y,  R6.y     
         z: MUL_64      ____,  R6.y,  R6.y     
         w: MUL_64      ____,  R6.x,  R6.x     
         t: ADD_INT     R8.z,  R6.z,  R1.z     
     15  x: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
         y: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
         z: FMA_64      ____,  PV14.y,  R3.y,  R3.w      VEC_021
         w: FMA_64      ____,  PV14.x,  R3.x,  R3.z     
         t: ADD_INT     R8.w,  R6.w,  R1.z     
     16  x: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
         y: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
         z: FMA_64      ____,  PV15.y,  T0.y,  R2.y     
         w: FMA_64      ____,  PV15.x,  T0.x,  R2.x     
     17  x: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
         y: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
         z: FMA_64      ____,  PV16.y,  T0.y,  R2.w     
         w: FMA_64      ____,  PV16.x,  T0.x,  R2.z     
     18  x: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
         y: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
         z: FMA_64      ____,  PV17.y,  T0.y,  R4.y     
         w: FMA_64      ____,  PV17.x,  T0.x,  R4.x     
     19  x: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
         y: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
         z: FMA_64      ____,  PV18.y,  T0.y,  R4.w     
         w: FMA_64      ____,  PV18.x,  T0.x,  R4.z     
     20  x: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
         y: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
         z: FMA_64      ____,  PV19.y,  T0.y,  R0.y     
         w: FMA_64      ____,  PV19.x,  T0.x,  R0.x     
     21  x: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
         y: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
         z: FMA_64      ____,  PV20.y,  T0.y,  R0.w     
         w: FMA_64      ____,  PV20.x,  T0.x,  R0.z     
     22  x: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
         y: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
         z: FMA_64      ____,  PV21.y,  T0.y,  R5.y     
         w: FMA_64      ____,  PV21.x,  T0.x,  R5.x     
     23  x: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
         y: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
         z: FMA_64      ____,  PV22.y,  T0.y,  R5.w     
         w: FMA_64      ____,  PV22.x,  T0.x,  R5.z     
     24  x: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
         y: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
         z: FMA_64      ____,  PV23.y,  T0.y,  R1.y     
         w: FMA_64      ____,  PV23.x,  T0.x,  R1.x     
     25  x: MUL_64      ____,  PV24.y,  R6.y     
         y: MUL_64      ____,  PV24.y,  R6.y     
         z: MUL_64      ____,  PV24.y,  R6.y     
         w: MUL_64      ____,  PV24.x,  R6.x  

0 Likes

Thank you very much for all the helpful answers. I already implemented a relatively naive version based on the taylor series expansion. Your version, fxxyjslqdm should be orders of magnitude faster though, so I'll try your approach. Thanks again!

0 Likes