Archives Discussions

coggy · ‎08-05-2010

Hi, I'm currently using OpenCL on a Radeon HD 5870 for an academic project doing calculations on doubles. As the only operations currently supported on doubles are addition, subtraction and multiplication, I am thinking about implementing at least cos(), possibly other functions manually.

Does anyone have any experience implementing trigonometric functions using OpenCL, e.g. the Cordic algorithm? Can I expect to achieve reasonable performance this way?

Thanks for any answers.

rick_weber · ‎08-05-2010

You'll probably have to do it using the Newton-Raphson method (an iterative method), and the existing single-precision trig functions. In addition to some subtracts, and divides, you'll have to call the single precision version several times. So, to answer your question, I would expect a double-precision trig function to be an order of magnitude or so slower than single.

http://en.wikipedia.org/wiki/Newton%27s_method

malcolm3141 · ‎08-05-2010

The current single precision trig functions are implemented without using the trig instructions. The trig instructions are not accurate enough to meet the OpenCL requirements (they're about 2e-3, but opencl requires about 4e-7 max error).

Sin and cos are best implemented with a taylor series expansion. I believe 4 terms is all that is necessary to achieve OpenCL accuracy requirements for float, and about 10 terms for double. You'll need to do argument reduction and expand either sin or cos based on the octant...

Best resource I have found for the math routines you might want is the Cephes mathematical library. Google it and you'll find very accurate implementations of all the functions you might need.

Malcolm

fxxyjslqd · ‎08-06-2010

I met the same problem. Here is my implementation.

A taylor series expansion for sin(x) in MATLAB as follows.

fsin=@(x) x.^21/51090942171709440000 - x.^19/121645100408832000 + x.^17/355687428096000 - x.^15/1307674368000 + x.^13/6227020800 - x.^11/39916800 + x.^9/362880 - x.^7/5040 + x.^5/120 - x.^3/6 + x;

Pa[] is a double2 arrary in constant memory.

Pa[20].x ... Pa[24].y={1/51090942171709440000.0,-1/121645100408832000.0,1/355687428096000.0,-1/1307674368000.0,1/6227020800.0,-1/39916800.0,1/362880.0,-1/5040.0,1/120.0,-1/6.0}

double sinf(double x, __constant double2 *Pa)
{
double x2=x*x;
double xk;
xk = mad(x2,Pa[20].x,Pa[20].y);
xk = mad(xk,x2,Pa[21].x);
xk = mad(xk,x2,Pa[21].y);
xk = mad(xk,x2,Pa[22].x);
xk = mad(xk,x2,Pa[22].y);
xk = mad(xk,x2,Pa[23].x);
xk = mad(xk,x2,Pa[23].y);
xk = mad(xk,x2,Pa[24].x);
xk = mad(xk,x2,Pa[24].y);
xk = mad(xk,x2,1.0f)*x;
return xk;
}

I use double2 instead of double to minimize fetching operations(5 fetches instead of 10 fetches).

If you wana compute cos(x), you can using the following formula

cos(x)=sinf(pi/2-x).

the max error for x

[0,pi/2] sin(x):1.1102e-016 cos(x):1.1102e-016

[0,pi] sin(x) 1.0348e-011 cos(x) 1.1102e-016

[-pi/2,pi/2] sin(x) 1.1102e-016 cos(x) 1.0348e-011

fxxyjslqd · ‎08-06-2010

After the horner transformation

fsin(x)=x*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2*(x^2/51090942171709440000 - 1/121645100408832000) + 1/355687428096000) - 1/1307674368000) + 1/6227020800) - 1/39916800) + 1/362880) - 1/5040) + 1/120) - 1/6) + 1)

MUL_64 2 ops

FMA_64 10 ops

total 12 ALU + 5Fetches

If you do this computation many times, you need to put the constant in GPR. However, it seems that the compiler always try to minimize the usage of GPR. So I cann't make it happen.

The ALU operation listed as follows

02 ALU: ADDR(58) CNT(52)
     14 x: MUL_64      T0.x, R6.y, R6.y
         y: MUL_64      T0.y, R6.y, R6.y
         z: MUL_64      ____, R6.y, R6.y
         w: MUL_64      ____, R6.x, R6.x
         t: ADD_INT     R8.z, R6.z, R1.z
     15 x: FMA_64      ____, PV14.y, R3.y, R3.w      VEC_021
         y: FMA_64      ____, PV14.y, R3.y, R3.w      VEC_021
         z: FMA_64      ____, PV14.y, R3.y, R3.w      VEC_021
         w: FMA_64      ____, PV14.x, R3.x, R3.z
         t: ADD_INT     R8.w, R6.w, R1.z
     16 x: FMA_64      ____, PV15.y, T0.y, R2.y
         y: FMA_64      ____, PV15.y, T0.y, R2.y
         z: FMA_64      ____, PV15.y, T0.y, R2.y
         w: FMA_64      ____, PV15.x, T0.x, R2.x
     17 x: FMA_64      ____, PV16.y, T0.y, R2.w
         y: FMA_64      ____, PV16.y, T0.y, R2.w
         z: FMA_64      ____, PV16.y, T0.y, R2.w
         w: FMA_64      ____, PV16.x, T0.x, R2.z
     18 x: FMA_64      ____, PV17.y, T0.y, R4.y
         y: FMA_64      ____, PV17.y, T0.y, R4.y
         z: FMA_64      ____, PV17.y, T0.y, R4.y
         w: FMA_64      ____, PV17.x, T0.x, R4.x
     19 x: FMA_64      ____, PV18.y, T0.y, R4.w
         y: FMA_64      ____, PV18.y, T0.y, R4.w
         z: FMA_64      ____, PV18.y, T0.y, R4.w
         w: FMA_64      ____, PV18.x, T0.x, R4.z
     20 x: FMA_64      ____, PV19.y, T0.y, R0.y
         y: FMA_64      ____, PV19.y, T0.y, R0.y
         z: FMA_64      ____, PV19.y, T0.y, R0.y
         w: FMA_64      ____, PV19.x, T0.x, R0.x
     21 x: FMA_64      ____, PV20.y, T0.y, R0.w
         y: FMA_64      ____, PV20.y, T0.y, R0.w
         z: FMA_64      ____, PV20.y, T0.y, R0.w
         w: FMA_64      ____, PV20.x, T0.x, R0.z
     22 x: FMA_64      ____, PV21.y, T0.y, R5.y
         y: FMA_64      ____, PV21.y, T0.y, R5.y
         z: FMA_64      ____, PV21.y, T0.y, R5.y
         w: FMA_64      ____, PV21.x, T0.x, R5.x
     23 x: FMA_64      ____, PV22.y, T0.y, R5.w
         y: FMA_64      ____, PV22.y, T0.y, R5.w
         z: FMA_64      ____, PV22.y, T0.y, R5.w
         w: FMA_64      ____, PV22.x, T0.x, R5.z
     24 x: FMA_64      ____, PV23.y, T0.y, R1.y
         y: FMA_64      ____, PV23.y, T0.y, R1.y
         z: FMA_64      ____, PV23.y, T0.y, R1.y
         w: FMA_64      ____, PV23.x, T0.x, R1.x
     25 x: MUL_64      ____, PV24.y, R6.y
         y: MUL_64      ____, PV24.y, R6.y
         z: MUL_64      ____, PV24.y, R6.y
         w: MUL_64      ____, PV24.x, R6.x

coggy · ‎08-07-2010

Thank you very much for all the helpful answers. I already implemented a relatively naive version based on the taylor series expansion. Your version, fxxyjslqdm should be orders of magnitude faster though, so I'll try your approach. Thanks again!

Archives Discussions

Manual implementation of trigonometric functions on doubles