I'm not sure this is the best forum for this question, so please move it if I guessed wrong. Since my previous related question got moved here, it is easier to continue here.

I just wrote some routines that you might call "vector functions". There will be a whole bunch of them eventually, but to start I wrote a function that computes the sine of the four f64 (double-precision) values in an f64vec4 variable, and another function that computes the cosine of the four f64 values in an f64vec4 variable.

I wrote these functions in assembly-language taking full advantage of the latest SIMD/AVX/FMA instructions the AMD bulldozer CPUs offer. You can see the assembly-language for one of these two working functions in my previous message. I'm pretty sure that's the latest, greatest working code (only edited to make the comments not-wrap and make the listing ugly and unreadable).

Here is my problem. I wrote 16 versions of the 4 * sine routine and 16 versions of the 4 * cosine routine. They are all based upon chebyshev approximations, but some have 7, 8, 9, 10, 11 coefficients. I also wrote versions with fused multiply-add and separated multiply add. I also wrote versions with vbroadcastsd and load 4 values of the coefficients in parallel. I tried just about every variation I could think of --- but still have a "precision problem".

Here is what I face. I call all these functions and get return values. I also call the C standard math library** sin() **and** cos() **functions. I also call my own **math_sin() **and** math_cos() **functions that I wrote in assembly language that just shove the values into the ancient 80x87 FPU and execute the FPU sin and cos instructions. I also call equivalent functions I wrote in C code. I also call** amd_sin() **and** amd_cos() **in the AMD math library. Here is what the results show over a range of inputs:

First, the** sin()**,** math_sin() **and** amd_sin() **functions always return the same values to the final least significant bit, and ditto for the** cos()**, **math_cos() **and** amd_cos() **functions. I don't recall ever seeing even one bit difference, and I print out the xor of the 64-bit values (as 64-bit integers) as one measure of the difference between the results (because it makes it easy to judge the difference in bits).

Within the range 0.0000 to PI/2, where these functions are supposed to work best, the results of my 16+ chebyshev computations *sometimes* agree to the last bit. But in a fair percentage of input values the error varies between 1-bit and ** 7~8 bits** of error. And rarely a difference of up to

*appears. The following is a printout of one of these egregious lines:*

**12-bits**

** input angle (radians) : result of sin() result of test function : result of cos() result of test function : s^ c^ (stst*stst) + (ctst*ctst)**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166659 : 0.999950000416665263 0.999950000416663376 : c 33 : 0.999999999999996225**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166645 : 0.999950000416665263 0.999950000416665263 : ff4 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166675 : 0.999950000416665263 0.999950000416665263 : 5 0 : 1.000000000000000000**

** 0.010000000000000002 : 0.009999833334166666 0.009999833334166675 : 0.999950000416665263 0.999950000416665263 : 5 0 : 1.000000000000000000**

The two lines with the most whitespace around them are an XOR of the bit patterns of the result of sine and cosine routines versus the results of** sin() **and** cos()**... which is the result of the FPU instructions. The 1st column is the input angle in radians. The 2nd column is the result of** sin()**. The 3rd column is the result of the routine being tested, which is named in the last column. The 4th and 5th columns are the same as the 2nd and 3rd except for cosine instead of sine. The 6th and 7th columns are the columns to watch. They contain the bitwise XOR of the result of the tested sine and cosine function against the result of sin() and cos() respectively. The 8th column is simply the sum of the squares of the sine and cosine results, which should always equal 1.0000000000 if the sine and cosine are correct. The 9th column is the function name for each row, which didn't fit in this forum format, so I list the order below:

**sin() and cos() routines in C standard math library**

**my old assembly language routines that execute FPU sin and cos instructions**

**amd_sin() and amd_cos() in the AMD math library (which presumably execute bulldozer-specific instructions on my FX8150)**

**my medium precision routines written in C (about 9 coefficients)**

**my high precision routines written in C (about 11 coefficients)**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 1st result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 2nd result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 3rd result**

**my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 4th result**

**my hand coded SIMD/AVX/FMA assembly-language computing 7 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 8 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 9 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 10 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev**

**my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev with broadcastsd and without fused multiply-add**

Without including a whole slew of these printouts as testimony, I can say without any doubt that the** sin_c09() **and** cos_c09() **routines print out the smallest difference between** sin() **and** cos() **as any of my routines, and as a result are what I adopted in my 4 * sin and 4 * cos routines. Just to start, this situation makes no sense --- on the surface, anyway --- because the 10 and 11 coefficient routines are supposed to be more precise than the 7 and 8 and 9 coefficient routines. Why would my 9 coefficient routine better match the "gold standard" FPU and AMD routines... unless by some accident that's also what they chose --- maybe even [almost] the same exact coefficients???

However, "what they chose" is not the same statement as "the correct answer". At the moment, I have no way to verify the AMD and intel claims that their routines are correct to within 1 bit (or 2 bits). As a matter of * fact*, the C library functions, the 80x87 FPU instructions, and the

**amd_sin()**and

**amd_cos()**functions always generate the same result. I've never even noticed them different by one bit. This even though the FPU routine computes internally with 80-bit values, and the

**amd_sin()**and

**amd_cos()**definitely compute with SIMD/AVX instructions, which are limited to 64-bits at every computation. Of course the 80-bit computations are somewhat bogus, because the input angles and the output results are truncated to 64-bits like all the routines. Still, I find it amazing, just amazing, that the FPU routine and the AMD functions are always identical, and my 9-coefficient routine is almost always identical... once in a great while different by 1-bit... and very rarely off by 2-bits. Here is another printout at a different angle for flavor.

**0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.000000000000000000 0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.000000000000000000 0.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408250 : 0.998078615601782904 0.998078615601781349 : 3fb 12 : 0.9999999999999970020.062000000000000048 : 0.061960286300408285 0.061960286300408285 : 0.998078615601782904 0.998078615601782904 : 0 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408160 : 0.998078615601782904 0.998078615601782904 : 3ee 0 : 1.0000000000000000000.062000000000000048 : 0.061960286300408285 0.061960286300408340 : 0.998078615601782904 0.998078615601782904 : 8 0 : 1.0000000000000002220.062000000000000048 : 0.061960286300408285 0.061960286300408340 : 0.998078615601782904 0.998078615601782904 : 8 0 : 1.000000000000000222**

Here we see the 8-coefficient and 10-coefficient sine routines generate grossly different results, but modest or zero error cosine results. We also see the 11-coefficient sine routine generating horrific results. But the wimpy 7-coefficient sine and cosine are "dead on" to the final bit, as is my trusty 9-coefficient routine.

All I can say is, the more I scroll through a console window full of 100s of these outputs at different angles is.... ***.

Before you float your favorite theory in response, consider the following table of results:

** **

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176158 : 1 1f : 1.000000000000000222**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945541 : 0.994802508570176047 0.994802508570175159 : 5 8 : 0.999999999999998113**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945388 : 0.994802508570176047 0.994802508570176047 : e 0 : 0.999999999999999889**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945665 : 0.994802508570176047 0.994802508570176047 : 1fa 0 : 1.000000000000000000**

** 0.102000000000000077 : 0.101823223983945582 0.101823223983945665 : 0.994802508570176047 0.994802508570176047 : 1fa 0 : 1.000000000000000000**

I call your attention to the large errors listed for the sine in the 11-coefficient routines. My question is this. If the FPU and AMD sine and cosine routines are so perfect, then why does the sum of the squares AKA ((sine*sine) + (cosine*cosine)) come out perfect (exactly 1.0000) for the 10 and 11 coefficient values, but not for the "perfect" FPU and AMD values?

Like I said before, ***.

Maybe I'm being too "harsh". Maybe I should not expect better. But I don't believe this. A difference of 9 freaking bits on a very non-extreme case (angle = 0.102 radians which is roughly 6-degrees). And we have the simplest possible verification of these supposedly "wrong" values saying "correct", while that same verification test says "no cigar" to the "perfect" routines.

I am confused in several ways. Here is a sampling.

#1: Why do the AMD results computed in the SIMD/AVX registers ** always** agree to the very last bit with the ancient FPU values?

#2: Why do the more precise (more co-efficients) versions of the chebyshev computations generates less precise results than the 9-coefficient version, (which takes "agreement with FPU and AMD" as our definition of precise).

#3: Why is my 9-coefficient routine, whether executed purely in C code, or executed with one input value and xmm registers, or executed with 4 input values simultaneously in ymm registers... always closer to the FPU and AMD results.

#4: Why does my straightforward verification test claim the * supposedly-wrong-by-9-bits* results of the 11-coefficient routine is "exactly correct" and then turn around and claim the supposedly

*answers out of the FPU and AMD standard routines "slightly fails" the verification test?*

**exact**

I realize this is not a forum for "numerical analyst wizards". Frankly, if someone can point me to the appropriate forum, I'd appreciate that. But being a hardcore programmer and long time AMD fan, I thought I'd start here. Who knows, maybe the dudes who wrote the AMD math library watch this forum. Not likely, but who knows? Any math wizards out there?

But then there is the other possible issue --- the chance these problems aren't a math-wizard problem (!!!!! and I am certainly no math wizard!!!!!), but a problem with these "gold standard" trig routines. Unfortunately, I am definitely not enough of a math wizard to write routines that get higher precision in order to test my double-precision routines, the FPU routines, and the AMD math library routines.

I've beaten myself up quite a bit looking for answers. I've double, triple and quadruple checked all my constants against other sources.

Anyone else have any brilliant ideas?

Hi maxreason,

Agree, it doesn't make sense that your test produces the same number for different input! When physics kept saying the speed of light can't possibly be the same regardless of your motion, Einstein said "Um, maybe it's your watch". So, maybe it's your test. S*S + C*C produces 2 kinds of rounding error both of which can contribute to the problem. Addition is particularly prone to error when adding numbers of different magnitude, as in your first case where you take the sin and cos of a number near 0.

A better test is probably what you already have, sum of absolute

xordifferences, that's pretty precise.As for a standard, you can use most any multi precision calculator as a reference, and there several C examples out there so you can link to your work. The following is from windows calculator and shows how the rounding has affected the test for the 10 coefficient routine.

AMD/Intel xxx_c10()0.009999833334166666 0.009999833334166645 sin

0.999950000416665263 0.999950000416665263 cos

1.000000000000000000 1.000000000000000000 (stest*stest) + (ctest*ctest)

0.999999999999999970 0.999999999999999970 'true' sin^2+cos^2 truncated from 40 digit calculator

full 40 digit calculator results for sin*sin+cos*cos test0.999999999999999970.46731085805641 AMD/Intel 40 digit sin*sin+cos*cos

0.999999999999999970.04731785802142 xxx_c10() 40 digit sin*sin+cos*cos

As for lower accuracy of the longer series, if they use the same series and coefficients then maybe rounding error comes into play here too, i.e., if you already have an accurate answer, then add more terms and it gets worse. You might need to control rounding in the mmx/avx unit. In gcc you can use these functions to program the mmx unit rounding and denormals, I don't remember the flags but you can look them up.

#include <immintrin.h>oldflags=__builtin_ia32_stmxcsr();__builtin_ia32_ldmxcsr(newflags);Nice to see people still programming computers. You should try the GPUs if you haven't already. I just discovered that the 7970's hardware float sin instruction (4 clocks, not microcode) has a near full float accuracy. Earlier GPUs had very low accuracy just for graphics. And, it's 4 quadrant.