Question asked by maxreason on Jul 15, 2012
Latest reply on Jul 16, 2012 by maxreason

I'm not sure this is the best forum for this question, so please move it if I guessed wrong.  Since my previous related question got moved here, it is easier to continue here.

I just wrote some routines that you might call "vector functions".  There will be a whole bunch of them eventually, but to start I wrote a function that computes the sine of the four f64 (double-precision) values in an f64vec4 variable, and another function that computes the cosine of the four f64 values in an f64vec4 variable.

I wrote these functions in assembly-language taking full advantage of the latest SIMD/AVX/FMA instructions the AMD bulldozer CPUs offer.  You can see the assembly-language for one of these two working functions in my previous message.  I'm pretty sure that's the latest, greatest working code (only edited to make the comments not-wrap and make the listing ugly and unreadable).

Here is my problem.  I wrote 16 versions of the 4 * sine routine and 16 versions of the 4 * cosine routine.  They are all based upon chebyshev approximations, but some have 7, 8, 9, 10, 11 coefficients.  I also wrote versions with fused multiply-add and separated multiply add.  I also wrote versions with vbroadcastsd and load 4 values of the coefficients in parallel.  I tried just about every variation I could think of --- but still have a "precision problem".

Here is what I face.  I call all these functions and get return values.  I also call the C standard math library sin() and cos() functions.  I also call my own math_sin() and math_cos() functions that I wrote in assembly language that just shove the values into the ancient 80x87 FPU and execute the FPU sin and cos instructions.  I also call equivalent functions I wrote in C code.  I also call amd_sin() and amd_cos() in the AMD math library.  Here is what the results show over a range of inputs:

First, the sin(), math_sin() and amd_sin() functions always return the same values to the final least significant bit, and ditto for the cos(), math_cos() and amd_cos() functions.  I don't recall ever seeing even one bit difference, and I print out the xor of the 64-bit values (as 64-bit integers) as one measure of the difference between the results (because it makes it easy to judge the difference in bits).

Within the range 0.0000 to PI/2, where these functions are supposed to work best, the results of my 16+ chebyshev computations sometimes agree to the last bit.  But in a fair percentage of input values the error varies between 1-bit and 7~8 bits of error.  And rarely a difference of up to 12-bits appears.  The following is a printout of one of these egregious lines:

input angle (radians)   :             result of sin()              result of test function :           result of cos()               result of test function :  s^   c^       (stst*stst) + (ctst*ctst)

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0    0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166659 :  0.999950000416665263   0.999950000416663376 :    c    33     : 0.999999999999996225

0.010000000000000002 :  0.009999833334166666   0.009999833334166666 :  0.999950000416665263   0.999950000416665263 :    0      0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166645 :  0.999950000416665263   0.999950000416665263 :  ff4     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166675 :  0.999950000416665263   0.999950000416665263 :     5     0     : 1.000000000000000000

0.010000000000000002 :  0.009999833334166666   0.009999833334166675 :  0.999950000416665263   0.999950000416665263 :     5     0     : 1.000000000000000000

The two lines with the most whitespace around them are an XOR of the bit patterns of the result of sine and cosine routines versus the results of sin() and cos()... which is the result of the FPU instructions.  The 1st column is the input angle in radians.  The 2nd column is the result of sin().  The 3rd column is the result of the routine being tested, which is named in the last column.  The 4th and 5th columns are the same as the 2nd and 3rd except for cosine instead of sine.  The 6th and 7th columns are the columns to watch.  They contain the bitwise XOR of the result of the tested sine and cosine function against the result of sin() and cos() respectively.  The 8th column is simply the sum of the squares of the sine and cosine results, which should always equal 1.0000000000 if the sine and cosine are correct.  The 9th column is the function name for each row, which didn't fit in this forum format, so I list the order below:

sin() and cos() routines in C standard math library

my old assembly language routines that execute FPU sin and cos instructions

amd_sin() and amd_cos() in the AMD math library (which presumably execute bulldozer-specific instructions on my FX8150)

my medium precision routines written in C (about 9 coefficients)

my high precision routines written in C (about 11 coefficients)

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 1st result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 2nd result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 3rd result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 4th result

my hand coded SIMD/AVX/FMA assembly-language computing 7 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 8 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 9 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 10 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev with broadcastsd and without fused multiply-add

Without including a whole slew of these printouts as testimony, I can say without any doubt that the sin_c09() and cos_c09() routines print out the smallest difference between sin() and cos() as any of my routines, and as a result are what I adopted in my 4 * sin and 4 * cos routines.  Just to start, this situation makes no sense --- on the surface, anyway --- because the 10 and 11 coefficient routines are supposed to be more precise than the 7 and 8 and 9 coefficient routines.  Why would my 9 coefficient routine better match the "gold standard" FPU and AMD routines... unless by some accident that's also what they chose --- maybe even [almost] the same exact coefficients???

However, "what they chose" is not the same statement as "the correct answer".  At the moment, I have no way to verify the AMD and intel claims that their routines are correct to within 1 bit (or 2 bits).  As a matter of fact, the C library functions, the 80x87 FPU instructions, and the amd_sin() and amd_cos() functions always generate the same result.  I've never even noticed them different by one bit.  This even though the FPU routine computes internally with 80-bit values, and the amd_sin() and amd_cos() definitely compute with SIMD/AVX instructions, which are limited to 64-bits at every computation.  Of course the 80-bit computations are somewhat bogus, because the input angles and the output results are truncated to 64-bits like all the routines.  Still, I find it amazing, just amazing, that the FPU routine and the AMD functions are always identical, and my 9-coefficient routine is almost always identical... once in a great while different by 1-bit... and very rarely off by 2-bits.  Here is another printout at a different angle for flavor.

0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408250 :  0.998078615601782904   0.998078615601781349 :  3fb   12  : 0.999999999999997002
0.062000000000000048 :  0.061960286300408285   0.061960286300408285 :  0.998078615601782904   0.998078615601782904 :      0      0  : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408160 :  0.998078615601782904   0.998078615601782904 :  3ee     0  : 1.000000000000000000
0.062000000000000048 :  0.061960286300408285   0.061960286300408340 :  0.998078615601782904   0.998078615601782904 :      8      0  : 1.000000000000000222
0.062000000000000048 :  0.061960286300408285   0.061960286300408340 :  0.998078615601782904   0.998078615601782904 :      8      0  : 1.000000000000000222

Here we see the 8-coefficient and 10-coefficient sine routines generate grossly different results, but modest or zero error cosine results.  We also see the 11-coefficient sine routine generating horrific results.  But the wimpy 7-coefficient sine and cosine are "dead on" to the final bit, as is my trusty 9-coefficient routine.

All I can say is, the more I scroll through a console window full of 100s of these outputs at different angles is.... ***.

Before you float your favorite theory in response, consider the following table of results:

0.102000000000000077 :  0.101823223983945582   0.101823223983945582 :  0.994802508570176047   0.994802508570176047 :     0     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945582 :  0.994802508570176047   0.994802508570176047 :     0     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945582 :  0.994802508570176047   0.994802508570176047 :     0     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945582 :  0.994802508570176047   0.994802508570176047 :     0     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176158 :     1   1f    : 1.000000000000000222

0.102000000000000077 :  0.101823223983945582   0.101823223983945541 :  0.994802508570176047   0.994802508570175159 :     5     8    : 0.999999999999998113

0.102000000000000077 :  0.101823223983945582   0.101823223983945596 :  0.994802508570176047   0.994802508570176047 :     1     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945388 :  0.994802508570176047   0.994802508570176047 :     e     0    : 0.999999999999999889

0.102000000000000077 :  0.101823223983945582   0.101823223983945665 :  0.994802508570176047   0.994802508570176047 :  1fa    0    : 1.000000000000000000

0.102000000000000077 :  0.101823223983945582   0.101823223983945665 :  0.994802508570176047   0.994802508570176047 :  1fa    0    : 1.000000000000000000

I call your attention to the large errors listed for the sine in the 11-coefficient routines.  My question is this.  If the FPU and AMD sine and cosine routines are so perfect, then why does the sum of the squares AKA ((sine*sine) + (cosine*cosine)) come out perfect (exactly 1.0000) for the 10 and 11 coefficient values, but not for the "perfect" FPU and AMD values?

Like I said before, ***.

Maybe I'm being too "harsh".  Maybe I should not expect better.  But I don't believe this.  A difference of 9 freaking bits on a very non-extreme case (angle = 0.102 radians which is roughly 6-degrees).  And we have the simplest possible verification of these supposedly "wrong" values saying "correct", while that same verification test says "no cigar" to the "perfect" routines.

I am confused in several ways.  Here is a sampling.

#1:  Why do the AMD results computed in the SIMD/AVX registers always agree to the very last bit with the ancient FPU values?

#2:  Why do the more precise (more co-efficients) versions of the chebyshev computations generates less precise results than the 9-coefficient version, (which takes "agreement with FPU and AMD" as our definition of precise).

#3:  Why is my 9-coefficient routine, whether executed purely in C code, or executed with one input value and xmm registers, or executed with 4 input values simultaneously in ymm registers... always closer to the FPU and AMD results.

#4:  Why does my straightforward verification test claim the supposedly-wrong-by-9-bits results of the 11-coefficient routine is "exactly correct" and then turn around and claim the supposedly exact answers out of the FPU and AMD standard routines "slightly fails" the verification test?

I realize this is not a forum for "numerical analyst wizards".  Frankly, if someone can point me to the appropriate forum, I'd appreciate that.  But being a hardcore programmer and long time AMD fan, I thought I'd start here.  Who knows, maybe the dudes who wrote the AMD math library watch this forum.  Not likely, but who knows?  Any math wizards out there?

But then there is the other possible issue --- the chance these problems aren't a math-wizard problem (!!!!! and I am certainly no math wizard!!!!!), but a problem with these "gold standard" trig routines.  Unfortunately, I am definitely not enough of a math wizard to write routines that get higher precision in order to test my double-precision routines, the FPU routines, and the AMD math library routines.

I've beaten myself up quite a bit looking for answers.  I've double, triple and quadruple checked all my constants against other sources.

Anyone else have any brilliant ideas?