3 Replies Latest reply on Nov 15, 2007 8:43 PM by eduardoschardong

    SSE optimizations not giving me much

      I have been doing some optimizations on some code (Kdtree traversal), and decided to give SSE via intrinsics a try (technique called packet tracing). The code is complete and correct, but performance is dissapointing (~12% speed improvement), while I was expecting a much larger increase.

      Running the exact same code on Intel gives me a much better increase (>24%), so I guess there is something else I can do to improve the code in question.

      What I am looking for is some advise on how to better tune the code for my Opteron servers, and wisdom one how to better utilize SSE extensions on these procs. At a different forum somebody mentioned that the problem could be with register pressure, and to be completely honest I am not sure what I should be looking at.

      I have run Code Analyst but I am not clear what I need to measure, and where the highest penalties are, instructions retired are not giving me enough information, what other data do people look at in situations like this?
        • SSE optimizations not giving me much
          First, for first and second generation note Opteron's FPU is only 64 bits wide, every packed SSE instructions is breaked in at least two macro-ops so don't expect to have the same performance gain as processors with FPUs with 128 bits wide like Core 2 and third generation Opteron, also, if you are going from the classic FPU to SSE keep in mind Intel processor take a big penality with the stack registers of FPU so using SSE with them shows a bigger improvment, also, wich precision you are using (single vs double)?.

          Is hard to explain how to use profiling to help... Usualy i look first at source code to know what could be happening, anyway, let's try:

          I would start looking at four events in the event-based profile, "Retired Instructions" (a), "Retired uops" (b), "Retired fastpath double op instructions" (c) and "CPU clocks not halted" (d).
          Use d as reference, if c is greater than 1 per cycle the fpu may be saturated, if a is close to 3 per cycle decoders may saturated, if a, b and c are small numbers the problem may be with dependencies, cache misses, branch miss-predictions or whatever.
          If units or decoders are saturated the only way to solve it is reducing the number of instructions.
          If a is close to c them your code could have a big improvment by using SSE.

          In the next step I would look at others events like "Data Cache Misses" (a), "L2 Cache Misses" (b), "Retired mispredicted branch instructions" (c).
          If a or b are too high then there is too many cache misses, check array access and consider using prefetch instrucitons. If c is too high try reducing the number of "ifs" in your program.

          For a 2GHz Opteron core, about 50, 5 and 30 millions per second are too high for a, b and c respectivaly.

          At the end... Look at pipeline simulation, this one is hard to explain, search for stalled cycles and its causes, usually the cause is long dependencies chains.
            • SSE optimizations not giving me much
              Thank you Eduardo, that is the type of information I am looking for.
              Is there a good reference to get what numbers are normal and what numbers are not? When I run the profiler I get a lot of different numbers, it is hard to tell which are normal and which have rates that could be improved, and you obviously now what to look for, so I wonder if you have suggestions on any good references to corelate the data.

              I am using floats (single precision)

              As far as the events you mentioned, this is what I got on my top function:

              CPU clocks: 134994
              Ret inst: 1263650
              Ret uops: 20241290
              Ret fastpath double op: 3857773
                • SSE optimizations not giving me much
                  There isn't a general rule of what is good and what is bad, everything depends on source code, those profile just tell you what is happening, if it is diferent from what you was specting then something is wrong.

                  Also, when I wrote the post above I didn't had the numbers from CA, I forgot completly that they aren't so user-friendly...

                  I think a good estimate for those numbers in #/per clock would be:
                  Instructions per clock: 0.47
                  uops per clock: 0.75
                  Fast path double: 0.29

                  In my opinion there is a good number of packed instrutions in the code so SSE could bring a big gain, but also there is something limiting it, like cache misses, memory bandwidth, etc.

                  BTW, what processor you are using?

                  Also, without the source code or even knowing what the code do I'm just guessing what could be happening...