Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

fastest way to implement SIMD/AVX conditional execution

Greetings and Aloha,

I've pretty much beat myself to death writing some efficient 256-bit parallel vector math functions like the following in 64-bit SIMD/AVX/FMA4 assembly language:

- math_sin_sin_sin_sin (f64vec4* result, f64vec4* input);

- math_sin_sin_cos_cos (f64vec4* result, f64vec4* input);

... and so forth

Now that these work pretty well over a small range (-pi/2 to +pi/2), I worry the "truly hard part" has arrived.  I need to add code at the top of these functions that essentially implements the following routine for all 4 input angles/arguments.  And hopefully without adding more CPU cycles than the trig routines themselves take... just to make sure the input arguments are within the appropriate range, and converting them when they are not:


// truncate angle into range -TWOPI < angle < +TWOPI


    if ((angle <= -MATH_TWOPI) || (angle >= MATH_TWOPI)) {

        dangle = angle * MATH_1DIVTWOPI;       // dangle == angle in units of 1 revolution AKA (2 * pi)

        intpart = math_integer_zero (dangle);

        angle = (dangle - intpart) * MATH_TWOPI;


    if (angle < 0.0000) { angle = angle + MATH_TWOPI; } // convert -neg angle into equivalent +pos angle


//    fold angle into range 0 <= angle <= 90 and generate appropriate sign to multiply by final result


    if (angle <= MATH_PIDIV2) {

        sign = +1.0000;

    } else if (angle <= MATH_PI) {

        sign = -1.0000;

        angle = MATH_PI - angle;

    } else if (angle <= (MATH_PI + MATH_PIDIV2)) {

        sign = -1.0000;

        angle = angle - MATH_PI;

    } else {

        sign = +1.0000;

        angle = MATH_TWOPI - angle;


I see and have these very cool SIMD/AVX/FMA4-level instructions:

  vcmppd     $imm8, %ymmX, %ymmS, %ymmD

Which let me perform the comparisons against -TWOPI or +TWOPI or PIDIV2 or PI or whatever value I need to compare against, and these instructions leave 0x0000000000000000 or 0xFFFFFFFFFFFFFFFF in the destination register when the specified comparison fails or passes.  Excellent!  So far so good.

But this is where I'm having trouble seeing my way to next steps to implement these routines on all four arguments.

I sorta see various potential strategies, but I don't see how to implement them efficiently.

For example, it would be nice if there was a way to simply skip past various work if all 4 arguments (in the 4 components of the tested ymm register) passed or failed the test.  I can see how doing two horizontal adds, plus one of the funky move instructions, plus another horizontal add could leave me with a zero or non-zero value in the bottom component of some ymm register to perform a conditional branch based upon --- after moving to the main CPU register set???  But seems to me, there must be a faster way than that!

Another fairly normal SIMD strategy is to execute all the instructions, but somewhere have a conditional move instruction that either moves the original value or the newly computed value to the destination based upon the 4-element ymm register mask generated by one of these vcmppd instructions.  Maybe I'm just blind, but I don't see any conditional move instructions at the moment.  That's strange, because I could swear I saw them once back a few months ago when I was reading about the new AVX / ymm instructions

I went browsing around looking for any articles on this topic, but I didn't find anything terribly helpful.

I will appreciate any help, even if it is simply a link or two to articles my searches missed.



Later:  After reviewing volume 4, I compiled a list of instructions that might help.  I'll list them below for anyone who eventually finds this thread, but I'm still interested in hearing discussions of "strategies" for ways to attack this situation.




















vpermil2pd --- too wacko/strange/complex for my puny brain !!!

10 Replies


If I understand well, you have an IF block, that you want to skip if all the SIMD lanes have failed a test before. And I assume that the IF block is costly and occuring rarely.

You can make the cmp first which is producing 000000 of FFFFFFF at each lanes. And then use that vmovmskpv instructiuon, which will gather the sign bits of all the 4 lanes, and put it into a general register, with that you can make a jump (jrcxz rcx or something).

>if (angle <= MATH_PIDIV2)     sign = +1.0000 else sign = 0.0;

This can be done with one CMP and then and AND with 1.0 float value. But since we have CMOV, this trick is kinda deprecated.

There's another trick when you can organize the rest of your algo around the CMP instruction, for example on the GPU I did an accumulator thing where the carry was accumulated with a negative sign (FFFFFFFs from the CMPs). (there was no subtract instruction for integers).

VPERMIL2PD -> first match on google says "VPERMIL2PS, VPERMIL2PD instructions also removed" :S Lol I've just realized that my sse3-4ish knowledge is a bit outdated. They even remove things from the good old "x86" o.O

Anyways, it's good to see that someone is fiddling with SIMD asm

Edit: VPERMIL2PD -> one more level of complication to the pshuffd, shuffps instructions But I'm 100% sure it have been invented for a specific use.

Journeyman III

Thanks for the comments.  I wish I had a better memory!  Sadly, my memory is terrible, and always has been.  A great memory is definitely an asset when programming with SIMD/AVX/FMA ymm-register style assembly language.

Actually, I quite enjoy programming in assembly language... and the crowd gasps in horror!  🙂

Maybe that's because assembly language is a "medium-level language" for me.  After writing boatloads of microcode and somewhat less FPGA, assembly-language is almost "too easy"... and the crowd pulls out pistols and shoots themselves in the head!  Hahaha.  I always get that reaction... dead bodies everywhere.  Hahahaha.  Good thing AMD doesn't earn their pennies on 2901s any more.  Hahaha.

It appears the vcmppd and vpcmov instructions will be my salvation.  One reason I failed to notice the value of the vpcmov instruction previously is the fact it is a bitwise instruction!  Good thing the vcmppd instruction sets or clears all bits in each 64-bit result it generates as a result of "pass" or "fail" of its comparison tests.  So much for "true = 1" and "false = 0".  Almost makes me remember BASIC, which does generate -1 (all 1s) for true (and accepts any non-zero as true).

I'm about half way to understanding how to totally gross code like that 4-level if-then-else-if-the-else-if-then-else mess in my original code (that only computed a result for one angle, not the 4 angles I must compute for.  I don't have the assembly written yet, but reformulating it the following way appears to have promise - first-cut (untested) assembly follows:


// ymm0.0123 == a0 : a1 : a2 : a3 == original input angles/arguments


vmovapd  k_2pi, %ymm15               // ymm15 (twopi)  = 2pi     : 2pi     : 2pi     : 2pi

vmovapd  k_3pidiv2, %ymm14           // ymm14 (3pi/2)  = 3pi/2   : 3pi/2   : 3pi/2   : 3pi/2

vmovapd  k_pi, %ymm13                // ymm13 (pi)     = pi      : pi      : pi      : pi

vmovapd  k_pidiv2, %ymm12            // ymm12 (pi/2)   = pi/2    : pi/2    : pi/2    : pi/2

vmovapd  k_p1, %ymm11                // ymm11 (4 * +1) = +1.00   : +1.00   : +1.00   : +1.00

vmovapd  k_m1, %ymm10                // ymm10 (4 * -1) = -1.00   : -1.00   : -1.00   : -1.00

vmovapd  %ymm00, %ymm09              // ymm09 (oangle) = a0      : a1      : a2      : a3

vmovapd  %ymm11, %ymm06              // ymm06 (sign)   = +1.00   : +1.00   : +1.00   : +1.00

vsubpd   %ymm09, %ymm13, %ymm08      // ymm08 (dangle) = pi-a0   : pi-a1   : pi-a2   : pi-a3

vcmppd   $GT, %ymm12, %ymm09, %ymm07 // ymm07 (mask)   = a0>pi/2 : a1>pi/2 : a2>pi/2 : a3>pi/2

vpcmov   ymm07, ymm00, ymm08, ymm00  // ymm00.n = ymm07 ? ymm00.n : ymm08.n

vpcmov   ymm07, ymm06, ymm10, ymm06  // ymm06.n = ymm07 ? ymm06.n : ymm10.n


// all 4 elements of %ymm00.n and %ymm06.n are now correct for quadrant 0 or 1


vsubpd   %ymm13, %ymm09, %ymm08      // ymm08 (dangle) = a0-pi   : a1-pi   : a2-pi   : a3-pi

vcmppd   %GT, %ymm13, %ymm09, %ymm07 // ymm07 (mask)   = a0>pi   : a1>pi   : a2>pi   : a3>pi

vpcmov   ymm07, ymm00, ymm08, ymm00  // ymm00.n = ymm07 ? ymm00.n : ymm08.n

vpcmov   ymm07, ymm06, ymm10, ymm06  // ymm06.n = ymm07 ? ymm07.n : ymm10.n
// the above line is not needed because sign in quadrants 1 and 2 are the same


// all 4 elements of %ymm00.n and %ymm06.n are now correct for quadrants 0 or 1 or 2


vsubpd   %ymm15, %ymm09, %ymm08      // ymm08 (dangle) = 2pi-a0  : 2pi-a1  : 2pi-a2  : 2pi-a3

vcmppd   %GT, %ymm14, %ymm09, %ymm07 // ymm07 (mask)   = a0>3pi/2: a1>3pi/2: a2>3pi/2: a3>3pi/2

vpcmov   ymm07, ymm00, ymm08, ymm00  // ymm00.n = ymm07 ? ymm00.n : ymm08.n

vpcmov   ymm07, ymm06, ymm11, ymm06  // ymm06.n = ymm07 ? ymm06.n : ymm11.n


As I reckon, the above computes that awful 4-section if-then-elseif-then-elseif-then-else section of code simultaneously for all 4 input angles/arguments.  Amazingly, I didn't even need the vcmppd instruction because subtracting pi/2 from test angle (tangle) at each step generates a negative value in tangle (ymm08.n.msb == 1) which controls the element-by-element conditional move.

Woops!  That doesn't work... because vpcmov is bitwise!  Okay, looks like a few more instructions might be required (but first I'll head to the instruction set again to see whether something controlled by msb exists).

Whew!  As it turns out, the vmaskmovpd is an alternative to the vpcmov instruction that works off the msb like I was thinking when I wrote the code.  Looks like someone else at AMD ran into the same coding situation.  🙂  So all the code above will be changed by the time anyone sees it, and realizes I was about to do something stupido!  Reputation saved in the nick of time!  🙂

Oh no!  The vmaskmovpd instruction doesn't have an all-registers variant like almost all other instructions.  I'm screwed!  😮  Now what?  Back to the vcpmpd and vpcmov technique, I guess.  But first, another grind through the instructions again.  But how could they have such a brilliant instruction, and not have an all-register version?  Boo, hoo.  😞

Oh well, just another day in the life of an assembly language programmer [dealing with necessarily wacko instructions to process simultaneous conditional execution on multiple elements].  It definitely looks like original vcpmpd and vpcmov technique is the best answer.  So time to fix the code yet again... sigh.  Later:  Okay, all the code above is re-written.  But beware, I have not tested it yet.  Surely I have the order of some registers reversed.

My final comment is --- I am pretty happy at the small number of instructions required to implement that computation for all 4 input angles/arguments.  Frankly, the 20 instructions I have above can be reduced by a few by not loading all those constants at the beginning, but instead making them memory arguments to later instructions.  However, this is almost always slower because the way I wrote it (and habituated myself to write) hides the cache/memory access latency by pre-fetching those values a while before they are needed.  It is soooooo nice to have 16 of those 256-bit wide SIMD registers!!!  🙂

IMPORTANT NOTE:  The above is written in linux/gas/att syntax, which has the instruction operands in REVERSE ORDER from intel syntax.  So if you have habituated intel syntax, my apologies --- and be sure to remember the destination is on the right.

intel order:  opcode  ymmD,       ymmS,  ymmX,  imm8/ymmM
-gas- order:  opcode  imm8/ymmM,  ymmX,  ymmS,  ymmD



Congrats! You're doing it good especially when you say you're new to this.

> doesn't have an all-registers variant like almost all other instructions

btw those three-operand non-destructive instructions are the new ones (came with the 256bit AVX instruction set), before that everything was two-operand where one of the operands always gets destroyed. So there's pretty much freedom nowadays (too bad I don't have an AVX capable processor hehe)

> constant loading:

When all the vector elements are the same, you van use VBROADCASTSD y,m64. It has better throughput that VMOVPD y,m256. (but of course, you have to combine it with another instr which is not using the memory read exec unit).

> instr level paralelism

You did a great job making your program in a way that there are no consequent instruction pairs which are data dependent. However the VSUBPD has a quiet big 5-6 latency. Maybe you can group those VSUBPDs to the beginning so it can hide some latency, you'll need 2 more regs for this. All of the instructions you're using (VPCMOV, VCMP, VSUBPD) has a peak throughput of 1, so proper grouping is essential.

> memory reads

Unless you're making this a loop, you can spare some registers and cycles via giving some job to the memory reader exec.unit and remove some VMOVAPD/VBROADCASTSDs. I guess there's a 256bit wide bus from the L1 cache, so you can read something from the cache in every single cycle. For example ymm11..ymm15 are used only once, you could reference then through the memory, and you'll have 5 spare regs. Unless you're not ruining the cache, you always have 1 constant reading per clock for free. Tho' there are some instructions which have 2x or 3x throughput per cycle, there you have to be careful with memory operands.

Maybe you can find some more regs and process 8 angles at a time where you can completely get rid of the VSUBPD's 5-6 cycle latency.


If you can avoid it, because it has very big latency (I guess because of the YMM <-> RAX transition)

>VCMPPD It's AMD exclusive I don't know it this useful bitselect thing has an equivalent on intel, but if not then it have to put together with and/or/andn instructions which is 3x slower (on the 128bit implementation it's, only 1.5x slower because 2 of them can be executed in one cycle. They spared some gates from AVX haha)

You can check all the instruction latencies and throughputs here ->

Also on that site there are excellent books about how to optimize on various 'x86' chips.

> It is soooooo nice to have 16 of those 256-bit wide SIMD registers!!!

Hahh I'm stuck at 8x128bit with two-operand destructive instructions, that's not a bad logic game either

But GPU is more fun: there you work with a 2048bit SIMD machine and usually you can have 84 res or can have 256 if you sacrifice some speed. You also have a scalar helper cpu thing with which you control program flow, coordinate memory transfers, or even you can do some simple integer calculations to help the SIMD unit. (it has lots of regs because memory access is relatively slow compared to x86)

Journeyman III

Okay, just for fun, here is the code that includes those problematic issues I asked about, plus the rest of a routine that computes the cosine of all four f64 angles in an 4-element f64vec4 vector.  The code works, but I still have a few "issues" to figure out.  For example, I have different routines to compute with 7, 8, 9, 10, 11 chebyshev coefficients.  In theory, the more coefficients, the more accurate the answer.  But for some reason, the 9 coefficient sine and cosine routines are the most accurate.  Strange.  I'm comparing my precision against the 80x87 sin and cos instructions and the amd_sin() and amd_cos() instructions (which always agree with the 80x87 instructions to the final bit.

Sorry the following wraps in this forum view --- you should have seen it before I cleaned it up.  When I inserted the code, it converted all tabs into four spaces and therefore broke the nice alignment that I sorta restored.  Best to cut and paste into a text editor and set your tab-stops at every 4 spaces.  Better yet, ask me for the code with the tabs still in and it will look better (and tell your editor not to wrap).

NOTE:  I have another version with vbroadcastsd instructions, which should be faster, but it didn't make much difference.  The reason I load the coefficients this way is because I'm planning to implement the following functions:

error = f64vec4_equal_sin_sin_sin_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_sin_sin_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_sin_cos_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_sin_cos_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_cos_sin_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_cos_sin_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_cos_cos_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_sin_cos_cos_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_sin_sin_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_sin_sin_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_sin_cos_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_sin_cos_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_cos_sin_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_cos_sin_cos_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_cos_cos_sin_f64vec4 (f64vec4* result, f64vec4* angles);

error = f64vec4_equal_cos_cos_cos_cos_f64vec4 (f64vec4* result, f64vec4* angles);

If I end up doing this, then I'd have separate coefficients for each function.  Right now the coefficients look like this:



# 9 term cosine function --- accurate to roughly 2.39e-16 over -pi/2 to +pi/2


.align    64


.double  +0.46123421496411967103700000000000000e-13, +0.46123421496411967103700000000000000e-13, +0.46123421496411967103700000000000000e-13,    +0.46123421496411967103700000000000000e-13

.double  -0.11463213408894234209400000000000000e-10, -0.11463213408894234209400000000000000e-10, -0.11463213408894234209400000000000000e-10,    -0.11463213408894234209400000000000000e-10

.double  +0.20876573491565493567600000000000000e-8,  +0.20876573491565493567600000000000000e-8,  +0.20876573491565493567600000000000000e-8,    +0.20876573491565493567600000000000000e-8

.double  -0.27557316616641921424000000000000000e-6,  -0.27557316616641921424000000000000000e-6,  -0.27557316616641921424000000000000000e-6,    -0.27557316616641921424000000000000000e-6

.double  +0.24801587279821183680800000000000000e-4,  +0.24801587279821183680800000000000000e-4,  +0.24801587279821183680800000000000000e-4,    +0.24801587279821183680800000000000000e-4

.double  -0.13888888888786869247600000000000000e-2,  -0.13888888888786869247600000000000000e-2,  -0.13888888888786869247600000000000000e-2,    -0.13888888888786869247600000000000000e-2

.double  +0.41666666666664276508000000000000000e-1,  +0.41666666666664276508000000000000000e-1,  +0.41666666666664276508000000000000000e-1,    +0.41666666666664276508000000000000000e-1

.double  -0.49999999999999978617200000000000000,     -0.49999999999999978617200000000000000,     -0.49999999999999978617200000000000000,        -0.49999999999999978617200000000000000

.double  +0.99999999999999999696100000000000000,     +0.99999999999999999696100000000000000,     +0.99999999999999999696100000000000000,        +0.99999999999999999696100000000000000

The above contains 4 copies each of the 9 coefficients for the following cosine function.  Ugly forum wrapping issues again.  Sorry.

The idea is, the above handles cos_cos_cos_cos().  There would be another table to handle cos_cos_cos_sin() with the first (or last?) of each group of 4 coefficients having a different value (the value to compute a sine instead of a cosine).

I'm not entirely certain I'll be doing things this way yet.  But at the very least, I'll have the following six variations:




Now all I need is someone to figure out why more than 9 coefficients doesn't produce greater precision.  I guess I would also like to know for certain that the FPU sin() and cos() and amd_sin() and amd_cos() are generating the exact correct answers.  BTW, my routines benchmark somewhere between 0% and 100% slower than the amd_sin() and amd_cos().  Of course, I'm computing 4 sines or cosines, so that helps get back the losses when you need 2, 3 or 4 sines and/or cosines computed.  Which is the idea, afterall.  Oh yeah.  I have equivalent routines with xmm registers instead of ymm registers, and they are somewhat faster.  I guess sometimes those 256-bit paths get clogged up, huh?

I left in quite a few instructions that are commented out.  Those instructions would be gradually enabled (in groups of 2) for each additional coefficient the routine is supposed to process.



# #############################################################

# #####  f64vec4 cosine via chebychev  :  9 coefficients  #####

# #############################################################


#  error = f64vec4_equal_sin_f64vec4 (f64vec4* result, const f64vec4* angles);


.align    64

    pushq     %rbp                    // save entry frame pointer ::: not necessary, leaf function

    movq      %rsp, %rbp              // create new frame pointer ::: not necessary, leaf function

    orq       %rdi, %rdi              // test rdi for zero

    jz        cos_f64vec4_error       // rdi == 0 == error (invalid output argument)

    orq       %rsi, %rsi              // test rsi for zero

    jz        cos_f64vec4_error       // rsi == 0 == error (invalid input argument)

    vmovapd   (%rsi), %ymm0           // ymm0 = angle0 : angle1 : angle2 : angle3


// get constants we will need during computation


    vmovapd  math_k_zero, %ymm3       // ymm03 (zero)   = 0.0000  : 0.0000  : 0.0000  : 0.0000

    vmovapd  math_k_itwopi, %ymm4     // ymm04 (1/2pi)  = 1/(2pi) : 1/(2pi) : 1/(2pi) : 1/(2pi)

    vmovapd  math_k_mtwopi, %ymm5     // ymm05 (-2pi)   = -2pi    : -2pi    : -2pi    : -2pi

    vmovapd  math_k_ptwopi, %ymm15    // ymm15 (twopi)  = +2pi    : +2pi    : +2pi    : +2pi

    vmovapd  math_k_3pidiv2, %ymm14   // ymm14 (3pi/2)  = 3pi/2   : 3pi/2   : 3pi/2   : 3pi/2

    vmovapd  math_k_pi, %ymm13        // ymm13 (pi)     = pi      : pi      : pi      : pi

    vmovapd  math_k_pidiv2, %ymm12    // ymm12 (pi/2)   = pi/2    : pi/2    : pi/2    : pi/2

    vmovapd  math_k_p1, %ymm11        // ymm11 (4 * +1) = +1.00   : +1.00   : +1.00   : +1.00

    vmovapd  math_k_m1, %ymm10        // ymm10 (4 * -1) = -1.00   : -1.00   : -1.00   : -1.00


// make sure input angles/arguments are in range 0.0000 to +2pi (or perhaps -twopi to +twopi is okay)


    vmulpd    %ymm0, %ymm4, %ymm0    // ymm00 (a/(2pi)  = a0/(2pi): a1/(2pi): a2/(2pi): a3/(2pi)

    vroundpd  $03, %ymm0, %ymm1      // ymm01 (angles)  = rtz(a0) : rtz(a1) : rtz(a2) : rtz(a3)

    vsubpd    %ymm1, %ymm0, %ymm0    // ymm00 (angles)  = frac(a0): frac(a1): frac(a2): frac(a3)

    vmulpd    %ymm0, %ymm15, %ymm0   // ymm00 (angles)  = a0      : a1      : a2      : a3

    vcmpltpd  %ymm3, %ymm0, %ymm7    // ymm07 (mask)    = a0<0.00 : a1<0.00 : a2<0.00 : a3<0.00

    vpcmov    %ymm7, %ymm3, %ymm15, %ymm3  // ymm03 (adjust)  = 0.0000 or 2pi depending on angle + or -

    vaddpd    %ymm3, %ymm0, %ymm0          // ymm00 (angles)  = a0      : a1      : a2      : a3


// now we are ready to fold the angles into quadrant 0 if necessary (not already in quadrant 0 == 0 to pi/2)


    vmovapd   %ymm0, %ymm9           // ymm09 (oangle) = a0      : a1      : a2      : a3

    vmovapd   %ymm11, %ymm3          // ymm03 (sign)   = +1.00   : +1.00   : +1.00   : +1.00

    vsubpd    %ymm9, %ymm13, %ymm8   // ymm08 (dangle) = pi-a0   : pi-a1   : pi-a2   : pi-a3

    vcmpltpd  %ymm9, %ymm12, %ymm7   // ymm07 (mask)   = a0>pi/2 : a1>pi/2 : a2>pi/2 : a3>pi/2

    vpcmov    %ymm7, %ymm0, %ymm8, %ymm0   // ymm00.n = ymm07 ? ymm09.n : ymm08.n

    vpcmov    %ymm7, %ymm3, %ymm10, %ymm3  // ymm03.n = ymm07 ? ymm03.n : ymm10.n


// all 4 elements of %ymm00.n and %ymm06.n are now correct for quadrant 0 or 1


    vsubpd    %ymm13, %ymm9, %ymm8   // ymm08 (dangle) = a0-pi   : a1-pi   : a2-pi   : a3-pi

    vcmpltpd  %ymm9, %ymm13, %ymm7   // ymm07 (mask)   = a0>pi   : a1>pi   : a2>pi   : a3>pi

    vpcmov    %ymm7, %ymm0, %ymm8, %ymm0   // ymm00.n = ymm07 ? ymm09.n : ymm08.n

    vpcmov    %ymm7, %ymm3, %ymm10, %ymm3  // ymm03.n = ymm07 ? ymm03.n : ymm10.n

// the above line is not needed because "sign" in quadrants 1 and 2 are the same (-1.000000)


// all 4 elements of %ymm00.n and %ymm06.n are now correct for quadrants 0 or 1 or 2


    vsubpd    %ymm9, %ymm15, %ymm8   // ymm08 (dangle) = 2pi-a0  : 2pi-a1  : 2pi-a2  : 2pi-a3

    vcmpltpd  %ymm9, %ymm14, %ymm7   // ymm07 (mask)   = a0>3pi/2: a1>3pi/2: a2>3pi/2: a3>3pi/2

    vpcmov    %ymm7, %ymm0, %ymm8, %ymm0   // ymm00.n = ymm07 ? ymm09.n : ymm08.n

    vpcmov    %ymm7, %ymm3, %ymm11, %ymm3  // ymm03.n = ymm07 ? ymm03.n : ymm11.n


// compute the function


    movq      $math_kcos_c09hc, %rax       // eax = address of 9 coefficients        ::: -pi/2 to +pi/2

    vmovapd   0x0000(%rax), %ymm4    // ymm4 =      k0

    vmovapd   0x0020(%rax), %ymm5    // ymm5 =      k1

    vmovapd   0x0040(%rax), %ymm6    // ymm6 =      k2

    vmovapd   0x0060(%rax), %ymm7    // ymm7 =      k3

//                                   // ymm0 =       a

    vmovapd   %ymm0, %ymm1           // ymm1 =       a    ::: (input arguments == angles in radians)

    vmovapd   %ymm0, %ymm2           // ymm2 =       a    ::: (input arguments == angles in radians)

    vmulpd    %ymm0, %ymm1, %ymm1    // ymm1 =      aa    ::: (input angles squared)


// begin loop (unrolled)


    vmovapd   %ymm4, %ymm0                    // ymm0 =         k0

    vmovapd   0x0080(%rax), %ymm4             // ymm4 =         k4

    vfmaddpd  %ymm5, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^2 * k0) + k1

    vmovapd   0x00A0(%rax), %ymm5             // ymm5 =         k5

    vfmaddpd  %ymm6, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^4 * k0) + (a^2 * k1) + k2

    vmovapd   0x00C0(%rax), %ymm6             // ymm6 =         k6

    vfmaddpd  %ymm7, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^6 * k0) + (a^4 * k1) + (a^2 * k2) + k3

    vmovapd   0x00E0(%rax), %ymm7             // ymm7 =         k7

    vfmaddpd  %ymm4, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^8 * k0) + (a^6 * k1) + (a^4 * k2) + (a^2 * k3) + k4

    vmovapd   0x0100(%rax), %ymm4             // ymm4 =         k8

    vfmaddpd  %ymm5, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^10 * k0) ...

//  vmovapd   0x0120(%rax), %ymm5             // ymm5 =         k9

    vfmaddpd  %ymm6, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^12 * k0) ...

//  vmovapd   0x0140(%rax), %ymm6             // ymm6 =         k10

    vfmaddpd  %ymm7, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^14 * k0) ...

//  vmovapd   0x0160(%rax), %ymm7             // ymm7 =         k11

    vfmaddpd  %ymm4, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^16 * k0) ...

//  vmovapd   0x0180(%rax), %ymm4             // ymm4 =         k12

//  vfmaddpd  %ymm5, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^18 * k0) ...

//  vmovapd   0x01A0(%rax), %ymm5             // ymm5 =         k13

//  vfmaddpd  %ymm6, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^20 * k0) ...

//  vmovapd   0x01C0(%rax), %ymm6             // ymm6 =         k14

//  vfmaddpd  %ymm7, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^22 * k0) ...

//  vmovapd   0x01E0(%rax), %ymm7             // ymm7 =         k15

//  vfmaddpd  %ymm4, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^24 * k0) ...

//  vfmaddpd  %ymm5, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^26 * k0) ...

//  vfmaddpd  %ymm6, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^28 * k0) ...

//  vfmaddpd  %ymm7, %ymm0, %ymm1, %ymm0      // ymm0 =  (a^30 * k0) ...

    vmulpd    %ymm3, %ymm0, %ymm0             // ymm0 =  final result (multiply by sign values computed above)

    vmovapd   %ymm0, (%rdi)                   // cosine.0123 = ymm0.0123

    xorq      %rax, %rax                      // %rax = 0 == "no error"

    movq      %rbp, %rsp                      // restore stack pointer ::: not necessary, leaf function

    popq      %rbp                            // restore frame pointer ::: not necessary, leaf function

    ret                                       // return to caller


    xorq      %rax, %rax                      // %rax = 0

    notq      %rax                            // %rax = -1 == general purpose error number

    movq      %rbp, %rsp                      // restore stack pointer ::: not necessary, leaf function

    popq      %rbp                            // restore frame pointer ::: not necessary, leaf function

    ret                                       // return to caller



fpu mul+add is not the same as fused muladd (vfmaddpd).

double a,b,c;



On the fpu with default fpu flags it will calculate the mul and the add with 80bit precision internally, but there is 64bit  rounding after the mul and after the add.

However vfmaddpd will do the mul, and produces an internal result whose precision I don't know, and right after it will do the addition and then 64bit rounding will happen only at the very end.

I guess the 9th step's extremely small refinement can add up more precisely with FMA because there is no rounding before the add.

(Just saw that the Bulldozer can do two xmm fmadds in a single cycle, using all the transistors of the 256bit vfmadd of course)


Well, there is one huge and important exception to that 80-bits of precision!

The original variables a, b, c are only 64-bit bits when they are shoved into the FPU.

The 16-bits of precision of the results are truncated (or at best rounded) when the 80-bit results are stored back into variables a, b, c.

The fused multiply-add instruction can at best save a 1-bit error in the least significant bit.  Unfortunately, except in extraordinary situations, that's all the 80-bit precision inside the ancient FPU unit can save too, most certainly including the case of one multiply and add.


Indeed 1 bit difference. On wiki it's shining more when they write that fma has MORE precision, but maybe they refer to more complicated things like divide.

Anyways "why more than 9 coefficients doesn't produce greater precision."

My guess is that you run out of the double.

Expecting the accumulated result can be big as 1e0 and possibly the 10h constant can be something like 1e-16. 16*log2(10)>53 bits, and there goes away your double precision.


Yes, I can understand running out of bits in the double precision value.

What I can't understand is why 10 & 11 coefficient routines generate much less precise results!


Have you debugged it? Hope it's not the CPU's SIMD.FP unit

(I'm trying to bring round my fixed point bignum multiplier thing on the hd7970, it's a real pain with no debugger and loads of 'meaningless' numbers everywhere haha)


I have not fully debugged the precision issue.

I certainly do have the basic routines working, but the precision issue is another matter.  However, I am slowly working through some of the preliminaries.  What I definitely won't be able to handle myself is reasons for precision performance of specific sets of chebyshev co-efficients.  I am vastly too crappy at math for something as esoteric as that.

Right now I'm working to make sure I have everything completely uniform, and I'm not dealing with range issues.  Every set of chebyshev co-efficients is intentionally designed to cover a certain range, typically pi/4 (45-degrees) or pi/2 (90-degrees) or pi (180-degrees).  The more you restrict your range, the fewer coefficients you need to get a specific level of precision.  At least that's the theory!  And indeed, if you push input angles beyond pi/4 into a routine with pi/4 coefficients, the precision goes down the tubes very, very fast.

It appears all the 7-coefficient sets are for pi/4.  The usual scam is to check for angles greater than pi/4, and when you find them, you compute the cosine of (pi/2 - angle) instead.  That's no problem when you're computing the sine or cosine of one value, but when you're computing 4 in parallel, well.... the question becomes "how do I compute sine of some of my 4 input angles, and cosines of others (an arbitrary mix).

That's where I'll be headed soon enough, even if I don't adopt any pi/4 co-efficient sets, because I want to at least provide functions to compute:


However, one way to handle input angles beyond the design limit for the co-efficients is to have the ability to flip the computation of any of the input angles from sine to cosine or cosine to sine.  Given those screwball techniques I adopted in the earlier range reduction portions, this should not be too difficult.  I'll just have to load both sine and cosine versions of each coefficient, then select one or the other based upon a mask that is all 0s for sine and all 1s for cosine.

I must say this.  Those vcmppd and vpcmov instructions are absolutely brilliant.  Without them, I'd be totally screwed.  With them, I can do just about anything I need to customize what happens to each of the 4 elements in each ymm register.