Archives Discussions

diegor1982 · ‎10-11-2010

how to efficiently "emulate" SSE4.1 instructions?

Hi everybody,

I have been developing some SSE code on my MacBook Pro5,3,

and my target architecture is an AMD Opteron 8380.

From what i saw, the target does not support the SSE4.1.

So I am looking for fast alternatives to "_mm_round_ps" and "_mm_extract_epi32",

as in my code I perform frequently these operations.

For the moment I am replacing those instructions as illustrated in the code attachment.

I do something similar for "_mm_extract_epi32" (i store the __m128i on the stack, and i access its entries from the stack).

By doing these replacements instead of using the SSE 4.1 instructions, the performance of this code goes down to 2.9 GFLOP/s from 3.3 GFLOP/s, per core. I was not expecting such degradation and I am sure I can do better I you teach me some smart tricks!

Thank you for helping me!

Best,

Diego

const __m128 one = _mm_set_ps1(1); //I WANT TO REPLACE THIS: // const __m128 apx = _mm_round_ps(x, _MM_FROUND_FLOOR )-one; //SO I DO THIS union ma_che_schifo { __m128 v; float f32[4]; }; { ma_che_schifo data; _mm_store_ps(data.f32, apx); for(int i=0;i<4;++i) data.f32 = floorf(data.f32)-1; apx = _mm_load_ps(data.f32); }

avk · ‎10-14-2010

What about CVTPS2DQ/CVTDQ2PS instructions? You could use them to round as a side effect. Or, if you do like performance, use the analog of CVTPS2DQ = ADDPS xmm0, xmm1 + SUBPS xmm0, xmm1; where xmm1 should contain a float value = pow (2, 23) + pow (2, 22) (this trick is called a Right Shift conversion). But please remember that these analogs have one culprit - the numbers you are rounding should be in this range: -2 pow 31...+2 pow 31 (i.e. it is similar to the "int" C type).

Archives Discussions

"_mm_roundps" and "_mm_extract_epi32" for Shanghai (Opteron 8380)