# fastest way to implement SIMD/AVX conditional execution

Question asked by maxreason on Jul 12, 2012
Latest reply on Jul 17, 2012 by maxreason

Greetings and Aloha,

I've pretty much beat myself to death writing some efficient 256-bit parallel vector math functions like the following in 64-bit SIMD/AVX/FMA4 assembly language:

- math_sin_sin_sin_sin (f64vec4* result, f64vec4* input);

- math_sin_sin_cos_cos (f64vec4* result, f64vec4* input);

... and so forth

Now that these work pretty well over a small range (-pi/2 to +pi/2), I worry the "truly hard part" has arrived.  I need to add code at the top of these functions that essentially implements the following routine for all 4 input angles/arguments.  And hopefully without adding more CPU cycles than the trig routines themselves take... just to make sure the input arguments are within the appropriate range, and converting them when they are not:

//

// truncate angle into range -TWOPI < angle < +TWOPI

//

if ((angle <= -MATH_TWOPI) || (angle >= MATH_TWOPI)) {

dangle = angle * MATH_1DIVTWOPI;       // dangle == angle in units of 1 revolution AKA (2 * pi)

intpart = math_integer_zero (dangle);

angle = (dangle - intpart) * MATH_TWOPI;

}

if (angle < 0.0000) { angle = angle + MATH_TWOPI; } // convert -neg angle into equivalent +pos angle

//

//    fold angle into range 0 <= angle <= 90 and generate appropriate sign to multiply by final result

//

if (angle <= MATH_PIDIV2) {

sign = +1.0000;

} else if (angle <= MATH_PI) {

sign = -1.0000;

angle = MATH_PI - angle;

} else if (angle <= (MATH_PI + MATH_PIDIV2)) {

sign = -1.0000;

angle = angle - MATH_PI;

} else {

sign = +1.0000;

angle = MATH_TWOPI - angle;

}

I see and have these very cool SIMD/AVX/FMA4-level instructions:

vcmppd     \$imm8, %ymmX, %ymmS, %ymmD

Which let me perform the comparisons against -TWOPI or +TWOPI or PIDIV2 or PI or whatever value I need to compare against, and these instructions leave 0x0000000000000000 or 0xFFFFFFFFFFFFFFFF in the destination register when the specified comparison fails or passes.  Excellent!  So far so good.

But this is where I'm having trouble seeing my way to next steps to implement these routines on all four arguments.

I sorta see various potential strategies, but I don't see how to implement them efficiently.

For example, it would be nice if there was a way to simply skip past various work if all 4 arguments (in the 4 components of the tested ymm register) passed or failed the test.  I can see how doing two horizontal adds, plus one of the funky move instructions, plus another horizontal add could leave me with a zero or non-zero value in the bottom component of some ymm register to perform a conditional branch based upon --- after moving to the main CPU register set???  But seems to me, there must be a faster way than that!

Another fairly normal SIMD strategy is to execute all the instructions, but somewhere have a conditional move instruction that either moves the original value or the newly computed value to the destination based upon the 4-element ymm register mask generated by one of these vcmppd instructions.  Maybe I'm just blind, but I don't see any conditional move instructions at the moment.  That's strange, because I could swear I saw them once back a few months ago when I was reading about the new AVX / ymm instructions

I went browsing around looking for any articles on this topic, but I didn't find anything terribly helpful.

I will appreciate any help, even if it is simply a link or two to articles my searches missed.

Thanks!

================

Later:  After reviewing volume 4, I compiled a list of instructions that might help.  I'll list them below for anyone who eventually finds this thread, but I'm still interested in hearing discussions of "strategies" for ways to attack this situation.

vcmppd

vpcmov

vblendpd

vblendvpd

vmovmskpd

vpunpcklqdq

vpunpckhqdq

unpcklpd

vunpcklpd

unpckhpd

vunpckhpd

vextractf128

vinsertf128