This content has been marked as final. Show 3 replies
First, for first and second generation note Opteron's FPU is only 64 bits wide, every packed SSE instructions is breaked in at least two macro-ops so don't expect to have the same performance gain as processors with FPUs with 128 bits wide like Core 2 and third generation Opteron, also, if you are going from the classic FPU to SSE keep in mind Intel processor take a big penality with the stack registers of FPU so using SSE with them shows a bigger improvment, also, wich precision you are using (single vs double)?.
Is hard to explain how to use profiling to help... Usualy i look first at source code to know what could be happening, anyway, let's try:
I would start looking at four events in the event-based profile, "Retired Instructions" (a), "Retired uops" (b), "Retired fastpath double op instructions" (c) and "CPU clocks not halted" (d).
Use d as reference, if c is greater than 1 per cycle the fpu may be saturated, if a is close to 3 per cycle decoders may saturated, if a, b and c are small numbers the problem may be with dependencies, cache misses, branch miss-predictions or whatever.
If units or decoders are saturated the only way to solve it is reducing the number of instructions.
If a is close to c them your code could have a big improvment by using SSE.
In the next step I would look at others events like "Data Cache Misses" (a), "L2 Cache Misses" (b), "Retired mispredicted branch instructions" (c).
If a or b are too high then there is too many cache misses, check array access and consider using prefetch instrucitons. If c is too high try reducing the number of "ifs" in your program.
For a 2GHz Opteron core, about 50, 5 and 30 millions per second are too high for a, b and c respectivaly.
At the end... Look at pipeline simulation, this one is hard to explain, search for stalled cycles and its causes, usually the cause is long dependencies chains.
Thank you Eduardo, that is the type of information I am looking for.
Is there a good reference to get what numbers are normal and what numbers are not? When I run the profiler I get a lot of different numbers, it is hard to tell which are normal and which have rates that could be improved, and you obviously now what to look for, so I wonder if you have suggestions on any good references to corelate the data.
I am using floats (single precision)
As far as the events you mentioned, this is what I got on my top function:
CPU clocks: 134994
Ret inst: 1263650
Ret uops: 20241290
Ret fastpath double op: 3857773
There isn't a general rule of what is good and what is bad, everything depends on source code, those profile just tell you what is happening, if it is diferent from what you was specting then something is wrong.
Also, when I wrote the post above I didn't had the numbers from CA, I forgot completly that they aren't so user-friendly...
I think a good estimate for those numbers in #/per clock would be:
Instructions per clock: 0.47
uops per clock: 0.75
Fast path double: 0.29
In my opinion there is a good number of packed instrutions in the code so SSE could bring a big gain, but also there is something limiting it, like cache misses, memory bandwidth, etc.
BTW, what processor you are using?
Also, without the source code or even knowing what the code do I'm just guessing what could be happening...