More than one SSE floating point operation per cycle on Barcelona?

Discussion created by tschmielau on May 17, 2009
Latest reply on May 22, 2009 by extremeseos0007
What does it take to execute 1 fadd + 1 fmul + 1 mov per cycle?

I have a loop whose body I have arranged in groups of three instructions, each consisting of an addps, a mulps, and a movaps.

By sufficient unrolling and use of all 16 SSE registers I can schedule the instructions to completely hide the 4 cycle latency of the adds and muls, and 2 cycle latency of mov.

So basically my code looks like repeated iterations of

 addps %xmm0, %xmm8
 mulps %xmm12, %xmm4
 movaps 32(%rsi), %xmm0

 addps %xmm1, %xmm9
 mulps %xmm13, %xmm5

 addps %xmm2, %xmm10
 mulps %xmm13, %xmm6
 movaps %xmm0, %xmm1

 addps %xmm3, %xmm11
 mulps %xmm14, %xmm7
 movaps  %xmm0, %xmm2

 movaps %xmm8, 32(%rdx)
 movaps %xmm0, %xmm3

where in each iterations the registers would be permuted (but honor the 4 cycle latencies). All loads and stores are expected to operate in L1 cache (except for the initial loop iteration, which can be neglected).

I would expect each group of three instructions to be executed in a single cycle on Barcelona. However, when I actually time the code it is almost 50% slower, corresponding to exactly 1 floating point (fmul or fadd) instruction per cycle.

Compared to my hand-optimized code described above, the suboptimal compiler generated code performs only marginally worse, indicating that Barcelona's out-of-order execution does a very good job.

Whatever I do, I seem to hit a wall at one floating point instruction per cycle (instead of 2/cycle resulting from the independent fadd and fmul pipelines).

What does it take make full use of both fadd and fmul pipelines in parallel?

Thanks a lot for any help!