I have a loop whose body I have arranged in groups of three instructions, each consisting of an addps, a mulps, and a movaps.
By sufficient unrolling and use of all 16 SSE registers I can schedule the instructions to completely hide the 4 cycle latency of the adds and muls, and 2 cycle latency of mov.
So basically my code looks like repeated iterations of
addps %xmm0, %xmm8
mulps %xmm12, %xmm4
movaps 32(%rsi), %xmm0
addps %xmm1, %xmm9
mulps %xmm13, %xmm5
addps %xmm2, %xmm10
mulps %xmm13, %xmm6
movaps %xmm0, %xmm1
addps %xmm3, %xmm11
mulps %xmm14, %xmm7
movaps %xmm0, %xmm2
movaps %xmm8, 32(%rdx)
movaps %xmm0, %xmm3
where in each iterations the registers would be permuted (but honor the 4 cycle latencies). All loads and stores are expected to operate in L1 cache (except for the initial loop iteration, which can be neglected).
I would expect each group of three instructions to be executed in a single cycle on Barcelona. However, when I actually time the code it is almost 50% slower, corresponding to exactly 1 floating point (fmul or fadd) instruction per cycle.
Compared to my hand-optimized code described above, the suboptimal compiler generated code performs only marginally worse, indicating that Barcelona's out-of-order execution does a very good job.
Whatever I do, I seem to hit a wall at one floating point instruction per cycle (instead of 2/cycle resulting from the independent fadd and fmul pipelines).
What does it take make full use of both fadd and fmul pipelines in parallel?
Thanks a lot for any help!
OK, through some stroke of luck I just got gcc to achieve 1.4 floating points per cycle.
I'll analyze tomorrow how it's managed to beat my hand-optimized code.
How much you get using only mulps and addps?
I'm not shure, i think Barcelona has a issue with the scheduler that sometimes put register to register movs in the same pipe as mul or add and then with instructions with different latencies in the same pipe, also, the store takes two fstore slots, not one.
Thanks, Eduardo. I guess you are right about the scheduling anomaly.
Removing all movaps, I get around 1.26 flop/cycle (and of course the wrong result. The new 4-address fmad from SSE5 would be really handy here). It's true the store takes two fstore slots. However, there are five slots available per iteration, so I don't expect this to be a bottleneck.
Interestingly, the compiler-generated variant (which I managed to improve to 1.47 flop/cycle) has one load operation more per iteration than the hand-optimized code. I'll experiment a little bit more with it. Maybe I can get it even faster with more cache reads (there is plenty of unused cache-read bandwidth). But, since the compiler-generated code now has reached almost 92% of what I expected from the hand-optimized code, I'll probably just be happy with that and the fact I do not need to validate any assembler code.