how to efficiently "emulate" SSE4.1 instructions?

I have been developing some SSE code on my MacBook Pro5,3,

and my target architecture is an AMD Opteron 8380.

From what i saw, the target does not support the SSE4.1.

So I am looking for fast alternatives to "_mm_round_ps" and "_mm_extract_epi32", 

as in my code I perform frequently these operations. 

For the moment I am replacing those instructions as illustrated in the code attachment.

I do something similar for "_mm_extract_epi32" (i store the __m128i on the stack, and i access its entries from the stack).

By doing these replacements instead of using the SSE 4.1 instructions, the performance of this code goes down to 2.9 GFLOP/s from 3.3 GFLOP/s, per core. I was not expecting such degradation and I am sure I can do better I you teach me some smart tricks!

const __m128 one = _mm_set_ps1(1); //I WANT TO REPLACE THIS: // const __m128 apx = _mm_round_ps(x, _MM_FROUND_FLOOR )-one; //SO I DO THIS union ma_che_schifo { __m128 v; float f32[4]; }; { ma_che_schifo data; _mm_store_ps(data.f32, apx); for(int i=0;i<4;++i) data.f32[i] = floorf(data.f32[i])-1; apx = _mm_load_ps(data.f32); }