how to efficiently "emulate" SSE4.1 instructions?
Hi everybody,
I have been developing some SSE code on my MacBook Pro5,3,
and my target architecture is an AMD Opteron 8380.
From what i saw, the target does not support the SSE4.1.
So I am looking for fast alternatives to "_mm_round_ps" and "_mm_extract_epi32",
as in my code I perform frequently these operations.
For the moment I am replacing those instructions as illustrated in the code attachment.
I do something similar for "_mm_extract_epi32" (i store the __m128i on the stack, and i access its entries from the stack).
By doing these replacements instead of using the SSE 4.1 instructions, the performance of this code goes down to 2.9 GFLOP/s from 3.3 GFLOP/s, per core. I was not expecting such degradation and I am sure I can do better I you teach me some smart tricks!
Thank you for helping me!
Best,
Diego
const __m128 one = _mm_set_ps1(1); //I WANT TO REPLACE THIS: // const __m128 apx = _mm_round_ps(x, _MM_FROUND_FLOOR )-one; //SO I DO THIS union ma_che_schifo { __m128 v; float f32[4]; }; { ma_che_schifo data; _mm_store_ps(data.f32, apx); for(int i=0;i<4;++i) data.f32 = floorf(data.f32)-1; apx = _mm_load_ps(data.f32); }