I have a loop whose body I have arranged in groups of three instructions, each consisting of an addps, a mulps, and a movaps.
By sufficient unrolling and use of all 16 SSE registers I can schedule the instructions to completely hide the 4 cycle latency of the adds and muls, and 2 cycle latency of mov.
So basically my code looks like repeated iterations of
addps %xmm0, %xmm8
mulps %xmm12, %xmm4
movaps 32(%rsi), %xmm0
addps %xmm1, %xmm9
mulps %xmm13, %xmm5
addps %xmm2, %xmm10
mulps %xmm13, %xmm6
movaps %xmm0, %xmm1
addps %xmm3, %xmm11
mulps %xmm14, %xmm7
movaps %xmm0, %xmm2
movaps %xmm8, 32(%rdx)
movaps %xmm0, %xmm3
where in each iterations the registers would be permuted (but honor the 4 cycle latencies). All loads and stores are expected to operate in L1 cache (except for the initial loop iteration, which can be neglected).
I would expect each group of three instructions to be executed in a single cycle on Barcelona. However, when I actually time the code it is almost 50% slower, corresponding to exactly 1 floating point (fmul or fadd) instruction per cycle.
Compared to my hand-optimized code described above, the suboptimal compiler generated code performs only marginally worse, indicating that Barcelona's out-of-order execution does a very good job.
Whatever I do, I seem to hit a wall at one floating point instruction per cycle (instead of 2/cycle resulting from the independent fadd and fmul pipelines).
What does it take make full use of both fadd and fmul pipelines in parallel?
Thanks a lot for any help!