Hi everybody,
I have been developing some SSE code on my MacBook Pro5,3,
and my target architecture is an AMD Opteron 8380.
From what i saw, the target does not support the SSE4.1.
So I am looking for fast alternatives to "_mm_round_ps" and "_mm_extract_epi32",
as in my code I perform frequently these operations.
For the moment I am replacing those instructions as illustrated in the code attachment.
I do something similar for "_mm_extract_epi32" (i store the __m128i on the stack, and i access its entries from the stack).
By doing these replacements instead of using the SSE 4.1 instructions, the performance of this code goes down to 2.9 GFLOP/s from 3.3 GFLOP/s, per core. I was not expecting such degradation and I am sure I can do better I you teach me some smart tricks!
Thank you for helping me!
Best,
Diego
const __m128 one = _mm_set_ps1(1); //I WANT TO REPLACE THIS: // const __m128 apx = _mm_round_ps(x, _MM_FROUND_FLOOR )-one; //SO I DO THIS union ma_che_schifo { __m128 v; float f32[4]; }; { ma_che_schifo data; _mm_store_ps(data.f32, apx); for(int i=0;i<4;++i) data.f32 = floorf(data.f32)-1; apx = _mm_load_ps(data.f32); }
What about CVTPS2DQ/CVTDQ2PS instructions? You could use them to round as a side effect. Or, if you do like performance, use the analog of CVTPS2DQ = ADDPS xmm0, xmm1 + SUBPS xmm0, xmm1; where xmm1 should contain a float value = pow (2, 23) + pow (2, 22) (this trick is called a Right Shift conversion). But please remember that these analogs have one culprit - the numbers you are rounding should be in this range: -2 pow 31...+2 pow 31 (i.e. it is similar to the "int" C type).