Archives Discussions

maxreason · ‎07-15-2012

I'm not sure this is the best forum for this question, so please move it if I guessed wrong. Since my previous related question got moved here, it is easier to continue here.

I just wrote some routines that you might call "vector functions". There will be a whole bunch of them eventually, but to start I wrote a function that computes the sine of the four f64 (double-precision) values in an f64vec4 variable, and another function that computes the cosine of the four f64 values in an f64vec4 variable.

I wrote these functions in assembly-language taking full advantage of the latest SIMD/AVX/FMA instructions the AMD bulldozer CPUs offer. You can see the assembly-language for one of these two working functions in my previous message. I'm pretty sure that's the latest, greatest working code (only edited to make the comments not-wrap and make the listing ugly and unreadable).

Here is my problem. I wrote 16 versions of the 4 * sine routine and 16 versions of the 4 * cosine routine. They are all based upon chebyshev approximations, but some have 7, 8, 9, 10, 11 coefficients. I also wrote versions with fused multiply-add and separated multiply add. I also wrote versions with vbroadcastsd and load 4 values of the coefficients in parallel. I tried just about every variation I could think of --- but still have a "precision problem".

Here is what I face. I call all these functions and get return values. I also call the C standard math library sin() and cos() functions. I also call my own math_sin() and math_cos() functions that I wrote in assembly language that just shove the values into the ancient 80x87 FPU and execute the FPU sin and cos instructions. I also call equivalent functions I wrote in C code. I also call amd_sin() and amd_cos() in the AMD math library. Here is what the results show over a range of inputs:

First, the sin(), math_sin() and amd_sin() functions always return the same values to the final least significant bit, and ditto for the cos(), math_cos() and amd_cos() functions. I don't recall ever seeing even one bit difference, and I print out the xor of the 64-bit values (as 64-bit integers) as one measure of the difference between the results (because it makes it easy to judge the difference in bits).

Within the range 0.0000 to PI/2, where these functions are supposed to work best, the results of my 16+ chebyshev computations sometimes agree to the last bit. But in a fair percentage of input values the error varies between 1-bit and 7~8 bits of error. And rarely a difference of up to 12-bits appears. The following is a printout of one of these egregious lines:

input angle (radians) : result of sin() result of test function : result of cos() result of test function : s^ c^ (stst*stst) + (ctst*ctst)

0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000

0.010000000000000002 : 0.009999833334166666 0.009999833334166659 : 0.999950000416665263 0.999950000416663376 : c 33 : 0.999999999999996225

0.010000000000000002 : 0.009999833334166666 0.009999833334166666 : 0.999950000416665263 0.999950000416665263 : 0 0 : 1.000000000000000000

0.010000000000000002 : 0.009999833334166666 0.009999833334166645 : 0.999950000416665263 0.999950000416665263 : ff4 0 : 1.000000000000000000

0.010000000000000002 : 0.009999833334166666 0.009999833334166675 : 0.999950000416665263 0.999950000416665263 : 5 0 : 1.000000000000000000

The two lines with the most whitespace around them are an XOR of the bit patterns of the result of sine and cosine routines versus the results of sin() and cos()... which is the result of the FPU instructions. The 1st column is the input angle in radians. The 2nd column is the result of sin(). The 3rd column is the result of the routine being tested, which is named in the last column. The 4th and 5th columns are the same as the 2nd and 3rd except for cosine instead of sine. The 6th and 7th columns are the columns to watch. They contain the bitwise XOR of the result of the tested sine and cosine function against the result of sin() and cos() respectively. The 8th column is simply the sum of the squares of the sine and cosine results, which should always equal 1.0000000000 if the sine and cosine are correct. The 9th column is the function name for each row, which didn't fit in this forum format, so I list the order below:

sin() and cos() routines in C standard math library

my old assembly language routines that execute FPU sin and cos instructions

amd_sin() and amd_cos() in the AMD math library (which presumably execute bulldozer-specific instructions on my FX8150)

my medium precision routines written in C (about 9 coefficients)

my high precision routines written in C (about 11 coefficients)

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 1st result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 2nd result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 3rd result

my hand coded SIMD/AVX/FMA assembly-language computing 4 results simultaneously - 4th result

my hand coded SIMD/AVX/FMA assembly-language computing 7 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 8 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 9 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 10 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev

my hand coded SIMD/AVX/FMA assembly-language computing 11 coefficient chebyshev with broadcastsd and without fused multiply-add

Without including a whole slew of these printouts as testimony, I can say without any doubt that the sin_c09() and cos_c09() routines print out the smallest difference between sin() and cos() as any of my routines, and as a result are what I adopted in my 4 * sin and 4 * cos routines. Just to start, this situation makes no sense --- on the surface, anyway --- because the 10 and 11 coefficient routines are supposed to be more precise than the 7 and 8 and 9 coefficient routines. Why would my 9 coefficient routine better match the "gold standard" FPU and AMD routines... unless by some accident that's also what they chose --- maybe even [almost] the same exact coefficients???

However, "what they chose" is not the same statement as "the correct answer". At the moment, I have no way to verify the AMD and intel claims that their routines are correct to within 1 bit (or 2 bits). As a matter of fact, the C library functions, the 80x87 FPU instructions, and the amd_sin() and amd_cos() functions always generate the same result. I've never even noticed them different by one bit. This even though the FPU routine computes internally with 80-bit values, and the amd_sin() and amd_cos() definitely compute with SIMD/AVX instructions, which are limited to 64-bits at every computation. Of course the 80-bit computations are somewhat bogus, because the input angles and the output results are truncated to 64-bits like all the routines. Still, I find it amazing, just amazing, that the FPU routine and the AMD functions are always identical, and my 9-coefficient routine is almost always identical... once in a great while different by 1-bit... and very rarely off by 2-bits. Here is another printout at a different angle for flavor.

0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0     0   : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408250 : 0.998078615601782904   0.998078615601781349 : 3fb   12 : 0.999999999999997002
0.062000000000000048 : 0.061960286300408285   0.061960286300408285 : 0.998078615601782904   0.998078615601782904 :      0      0 : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408160 : 0.998078615601782904   0.998078615601782904 : 3ee     0 : 1.000000000000000000
0.062000000000000048 : 0.061960286300408285   0.061960286300408340 : 0.998078615601782904   0.998078615601782904 :      8      0 : 1.000000000000000222
0.062000000000000048 : 0.061960286300408285   0.061960286300408340 : 0.998078615601782904   0.998078615601782904 :      8      0 : 1.000000000000000222

Here we see the 8-coefficient and 10-coefficient sine routines generate grossly different results, but modest or zero error cosine results. We also see the 11-coefficient sine routine generating horrific results. But the wimpy 7-coefficient sine and cosine are "dead on" to the final bit, as is my trusty 9-coefficient routine.

All I can say is, the more I scroll through a console window full of 100s of these outputs at different angles is.... WTF.

Before you float your favorite theory in response, consider the following table of results:

0.102000000000000077 : 0.101823223983945582 0.101823223983945582 : 0.994802508570176047 0.994802508570176047 : 0 0 : 0.999999999999999889

0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889

0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176158 : 1 1f : 1.000000000000000222

0.102000000000000077 : 0.101823223983945582 0.101823223983945541 : 0.994802508570176047 0.994802508570175159 : 5 8 : 0.999999999999998113

0.102000000000000077 : 0.101823223983945582 0.101823223983945596 : 0.994802508570176047 0.994802508570176047 : 1 0 : 0.999999999999999889

0.102000000000000077 : 0.101823223983945582 0.101823223983945388 : 0.994802508570176047 0.994802508570176047 : e 0 : 0.999999999999999889

0.102000000000000077 : 0.101823223983945582 0.101823223983945665 : 0.994802508570176047 0.994802508570176047 : 1fa 0 : 1.000000000000000000

I call your attention to the large errors listed for the sine in the 11-coefficient routines. My question is this. If the FPU and AMD sine and cosine routines are so perfect, then why does the sum of the squares AKA ((sine*sine) + (cosine*cosine)) come out perfect (exactly 1.0000) for the 10 and 11 coefficient values, but not for the "perfect" FPU and AMD values?

Like I said before, WTF.

Maybe I'm being too "harsh". Maybe I should not expect better. But I don't believe this. A difference of 9 freaking bits on a very non-extreme case (angle = 0.102 radians which is roughly 6-degrees). And we have the simplest possible verification of these supposedly "wrong" values saying "correct", while that same verification test says "no cigar" to the "perfect" routines.

I am confused in several ways. Here is a sampling.

#1: Why do the AMD results computed in the SIMD/AVX registers always agree to the very last bit with the ancient FPU values?

#2: Why do the more precise (more co-efficients) versions of the chebyshev computations generates less precise results than the 9-coefficient version, (which takes "agreement with FPU and AMD" as our definition of precise).

#3: Why is my 9-coefficient routine, whether executed purely in C code, or executed with one input value and xmm registers, or executed with 4 input values simultaneously in ymm registers... always closer to the FPU and AMD results.

#4: Why does my straightforward verification test claim the supposedly-wrong-by-9-bits results of the 11-coefficient routine is "exactly correct" and then turn around and claim the supposedly exact answers out of the FPU and AMD standard routines "slightly fails" the verification test?

I realize this is not a forum for "numerical analyst wizards". Frankly, if someone can point me to the appropriate forum, I'd appreciate that. But being a hardcore programmer and long time AMD fan, I thought I'd start here. Who knows, maybe the dudes who wrote the AMD math library watch this forum. Not likely, but who knows? Any math wizards out there?

But then there is the other possible issue --- the chance these problems aren't a math-wizard problem (!!!!! and I am certainly no math wizard!!!!!), but a problem with these "gold standard" trig routines. Unfortunately, I am definitely not enough of a math wizard to write routines that get higher precision in order to test my double-precision routines, the FPU routines, and the AMD math library routines.

I've beaten myself up quite a bit looking for answers. I've double, triple and quadruple checked all my constants against other sources.

Anyone else have any brilliant ideas?

drallan · ‎07-15-2012

Hi maxreason,

Agree, it doesn't make sense that your test produces the same number for different input! When physics kept saying the speed of light can't possibly be the same regardless of your motion, Einstein said "Um, maybe it's your watch". So, maybe it's your test. S*S + C*C produces 2 kinds of rounding error both of which can contribute to the problem. Addition is particularly prone to error when adding numbers of different magnitude, as in your first case where you take the sin and cos of a number near 0.

A better test is probably what you already have, sum of absolute xor differences, that's pretty precise.

As for a standard, you can use most any multi precision calculator as a reference, and there several C examples out there so you can link to your work. The following is from windows calculator and shows how the rounding has affected the test for the 10 coefficient routine.

AMD/Intel xxx_c10()

0.009999833334166666 0.009999833334166645 sin

0.999950000416665263 0.999950000416665263 cos

1.000000000000000000 1.000000000000000000 (stest*stest) + (ctest*ctest)

0.999999999999999970 0.999999999999999970 'true' sin^2+cos^2 truncated from 40 digit calculator

full 40 digit calculator results for sin*sin+cos*cos test

0.999999999999999970.46731085805641 AMD/Intel 40 digit sin*sin+cos*cos

0.999999999999999970.04731785802142 xxx_c10() 40 digit sin*sin+cos*cos

As for lower accuracy of the longer series, if they use the same series and coefficients then maybe rounding error comes into play here too, i.e., if you already have an accurate answer, then add more terms and it gets worse. You might need to control rounding in the mmx/avx unit. In gcc you can use these functions to program the mmx unit rounding and denormals, I don't remember the flags but you can look them up.

#include <immintrin.h>

oldflags=__builtin_ia32_stmxcsr();

__builtin_ia32_ldmxcsr(newflags);

Nice to see people still programming computers. You should try the GPUs if you haven't already. I just discovered that the 7970's hardware float sin instruction (4 clocks, not microcode) has a near full float accuracy. Earlier GPUs had very low accuracy just for graphics. And, it's 4 quadrant.

maxreason · ‎07-16-2012

Yeah, that verification test probably isn't very good. But it was the only thing I could think of that didn't depend on knowing things I can't or don't know. I'm not sure whether an arbitrary digit calculator can help me. I'd have to know whether their sin() and cos() functions are based upon co-efficients like mine routines are. If so, then their routines are only as good as their coefficients, and how many digits are those? And how would we know? On the other hand, perhaps those arbitrary precision calculators adopt the taylor series technique, which is slower than chebyshev, but doesn't need any co-efficients (the algorithm more-or-less builds the co-efficients as it computes).

The problem with the XOR test is simply this. It simply presumes the result produced by the FPU sin() and cos(), and the amd_sin() and amd_cos() produce the correct answer. How do I know that, especially when supposedly very precise routines disagree (the ones with 10 or 11 coefficients). The XOR test only tells me how much different from the FPU and amd_xxx() functions. If I knew for certain they are correct, then the XOR test is valid. But again, how do I know that, especially given the answers all over the place that I'm seeing?

Yeah, I write 3D graphics engines "on the side". My latest one is based upon OpenGL and GLSL, so I've done a fair bit of shader programming. I haven't done much OpenCL, just some quick fooling around and a promise to get back to it. The problem with GPUs for me is that double precision is barely, just barely sufficient for any work I do. Since the GPUs drop dead on double precision (because they have far too few double-precision compute units in them), I haven't gotten that interested in OpenCL programming. But believe me, if someone releases a GPU that has double-precision compute hardware for every core --- well, I'll be writing OpenCL until my brain explodes.

GPUs are really great. But I hope they inspire AMD to start aggressively pushing the #-of-cores envelope on their CPUs. I'd like to see a 64-core unit next year, a 256-core unit in 2014, and 1024-core unit in 2015, a 4096-core unit in 2016, and a 65536-core unit in 2017. You listening, AMD? It is a bit of a travesty to need to do general purpose computing on freaking GPU cards!!! That should happen in the CPU, and leave the GPU for graphics. If they have to strip out everything that isn't 64-bits from the bulldozer to save transistors, then so be it. No point in 8-bit, 16-bit and 32-bit computing any more --- none whatsoever.

Archives Discussions

confusion about f64 precision