4 Replies Latest reply on May 22, 2009 12:06 PM by extremeseos0007

    More than one SSE floating point operation per cycle on Barcelona?

      What does it take to execute 1 fadd + 1 fmul + 1 mov per cycle?

      I have a loop whose body I have arranged in groups of three instructions, each consisting of an addps, a mulps, and a movaps.

      By sufficient unrolling and use of all 16 SSE registers I can schedule the instructions to completely hide the 4 cycle latency of the adds and muls, and 2 cycle latency of mov.

      So basically my code looks like repeated iterations of

       addps %xmm0, %xmm8
       mulps %xmm12, %xmm4
       movaps 32(%rsi), %xmm0

       addps %xmm1, %xmm9
       mulps %xmm13, %xmm5

       addps %xmm2, %xmm10
       mulps %xmm13, %xmm6
       movaps %xmm0, %xmm1

       addps %xmm3, %xmm11
       mulps %xmm14, %xmm7
       movaps  %xmm0, %xmm2

       movaps %xmm8, 32(%rdx)
       movaps %xmm0, %xmm3

      where in each iterations the registers would be permuted (but honor the 4 cycle latencies). All loads and stores are expected to operate in L1 cache (except for the initial loop iteration, which can be neglected).

      I would expect each group of three instructions to be executed in a single cycle on Barcelona. However, when I actually time the code it is almost 50% slower, corresponding to exactly 1 floating point (fmul or fadd) instruction per cycle.

      Compared to my hand-optimized code described above, the suboptimal compiler generated code performs only marginally worse, indicating that Barcelona's out-of-order execution does a very good job.

      Whatever I do, I seem to hit a wall at one floating point instruction per cycle (instead of 2/cycle resulting from the independent fadd and fmul pipelines).

      What does it take make full use of both fadd and fmul pipelines in parallel?

      Thanks a lot for any help!