1 Reply Latest reply on Oct 14, 2010 10:38 AM by avk

    "_mm_roundps" and "_mm_extract_epi32" for Shanghai (Opteron 8380)

      how to efficiently "emulate" SSE4.1 instructions?

      Hi everybody,

      I have been developing some SSE code on my MacBook Pro5,3,

      and my target architecture is an AMD Opteron 8380.

      From what i saw, the target does not support the SSE4.1.

      So I am looking for fast alternatives to "_mm_round_ps" and "_mm_extract_epi32", 

      as in my code I perform frequently these operations. 

      For the moment I am replacing those instructions as illustrated in the code attachment.

      I do something similar for "_mm_extract_epi32" (i store the __m128i on the stack, and i access its entries from the stack).

      By doing these replacements instead of using the SSE 4.1 instructions, the performance of this code goes down to 2.9 GFLOP/s from 3.3 GFLOP/s, per core. I was not expecting such degradation and I am sure I can do better I you teach me some smart tricks!

      Thank you for helping me!






      const __m128 one = _mm_set_ps1(1); //I WANT TO REPLACE THIS: // const __m128 apx = _mm_round_ps(x, _MM_FROUND_FLOOR )-one; //SO I DO THIS union ma_che_schifo { __m128 v; float f32[4]; }; { ma_che_schifo data; _mm_store_ps(data.f32, apx); for(int i=0;i<4;++i) data.f32[i] = floorf(data.f32[i])-1; apx = _mm_load_ps(data.f32); }

        • "_mm_roundps" and "_mm_extract_epi32" for Shanghai (Opteron 8380)

          What about CVTPS2DQ/CVTDQ2PS instructions? You could use them to round as a side effect. Or, if you do like performance, use the analog of CVTPS2DQ = ADDPS xmm0, xmm1 + SUBPS xmm0, xmm1; where xmm1 should contain a float value = pow (2, 23) + pow (2, 22)  (this trick is called a Right Shift conversion). But please remember that these analogs have one culprit - the numbers you are rounding should be in this range: -2 pow 31...+2 pow 31 (i.e. it is similar to the "int" C type).