cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

tschmielau
Journeyman III

More than one SSE floating point operation per cycle on Barcelona?

What does it take to execute 1 fadd + 1 fmul + 1 mov per cycle?

I have a loop whose body I have arranged in groups of three instructions, each consisting of an addps, a mulps, and a movaps.

By sufficient unrolling and use of all 16 SSE registers I can schedule the instructions to completely hide the 4 cycle latency of the adds and muls, and 2 cycle latency of mov.

So basically my code looks like repeated iterations of

 addps %xmm0, %xmm8
 mulps %xmm12, %xmm4
 movaps 32(%rsi), %xmm0

 addps %xmm1, %xmm9
 mulps %xmm13, %xmm5

 addps %xmm2, %xmm10
 mulps %xmm13, %xmm6
 movaps %xmm0, %xmm1

 addps %xmm3, %xmm11
 mulps %xmm14, %xmm7
 movaps  %xmm0, %xmm2

 movaps %xmm8, 32(%rdx)
 movaps %xmm0, %xmm3

where in each iterations the registers would be permuted (but honor the 4 cycle latencies). All loads and stores are expected to operate in L1 cache (except for the initial loop iteration, which can be neglected).

I would expect each group of three instructions to be executed in a single cycle on Barcelona. However, when I actually time the code it is almost 50% slower, corresponding to exactly 1 floating point (fmul or fadd) instruction per cycle.

Compared to my hand-optimized code described above, the suboptimal compiler generated code performs only marginally worse, indicating that Barcelona's out-of-order execution does a very good job.

Whatever I do, I seem to hit a wall at one floating point instruction per cycle (instead of 2/cycle resulting from the independent fadd and fmul pipelines).

What does it take make full use of both fadd and fmul pipelines in parallel?

Thanks a lot for any help!

 

0 Likes
4 Replies
tschmielau
Journeyman III

OK, through some stroke of luck I just got gcc to achieve 1.4 floating points per cycle.

I'll analyze tomorrow how it's managed to beat my hand-optimized code.

0 Likes
eduardoschardong
Journeyman III

How much you get using only mulps and addps?

I'm not shure, i think Barcelona has a issue with the scheduler that sometimes put register to register movs in the same pipe as mul or add and then with instructions with different latencies in the same pipe, also, the store takes two fstore slots, not one.

0 Likes

Thanks, Eduardo. I guess you are right about the scheduling anomaly.

Removing all movaps, I get around 1.26 flop/cycle (and of course the wrong result. The new 4-address fmad from SSE5 would be really handy here). It's true the store takes two fstore slots. However, there are five slots available per iteration, so I don't expect this to be a bottleneck.

Interestingly, the compiler-generated variant (which I managed to improve to 1.47 flop/cycle) has one load operation more per iteration than the hand-optimized code. I'll experiment a little bit more with it. Maybe I can get it even faster with more cache reads (there is plenty of unused cache-read bandwidth). But, since the compiler-generated code now has reached almost 92% of what I expected from the hand-optimized code, I'll probably just be happy with that and the fact I do not need to validate any assembler code.

0 Likes
extremeseos0007
Journeyman III

Its a nice piece of code.....right now I won't have any solution but I will try...to provide one

0 Likes