3 Replies Latest reply on Aug 11, 2009 12:21 AM by sarnath

    SSE performance mismatch

      I have been doing some optimizations on some code (Kdtree traversal), and decided to give SSE via intrinsics a try (technique called packet tracing). The code is complete and correct, but performance is dissapointing (~12% speed improvement), while I was expecting a much larger increase.

      Running the exact same code on Intel gives me a much better increase (>24%), so I guess there is something else I can do to improve the code in question.

      What I am looking for is some advise on how to better tune the code for my Opteron servers, and wisdom one how to better utilize SSE extensions on these procs. At a different forum somebody mentioned that the problem could be with register pressure, and to be completely honest I am not sure what I should be looking at.

      I have run Code Analyst but I am not clear what I need to measure, and where the highest penalties are, instructions retired are not giving me enough information, what other data do people look at in situations like this?
        • SSE performance mismatch

          I am also new to SSE..

          I implemented small matrix multiplication using SSE2. It runs twice fast on my Intel CPUs. But on AMD, the implementation runs slower than the non-SSE one... :-(

          Yes, It is double-precision math... (and hence 2x speedup)

          The matrices are 20x20 in dimension... Both matrices can be fully contained in the caches.

          The INtel L1 cache is 8-way associative whereas AMD is 2-way associative (but more sets in case of AMD).

          I use non-temporal writes to write out the result matrix so that Caches are not destroyed. (http://lwn.net/Articles/255364/ - for info on non-temporal writes)

          But still, i wonder why AMD Athlon X2 dual core is not enjoying SSE2.

          Any ideas?

            • SSE performance mismatch

              AFAIK K-8 family has much slower SSE unit than K-10 ( Phenoms etc) and roughly 1/2 of the bandwidth to the L1.

              Phenom can do one operation on whole 128-bit SSE register in 1 cycle while old K-8 needs two cycles ( it does its thing in 64-bit chunks)...

              Also, only 2-way associativity ond lack of full functionality of  PrefetchLevel instructions are totall bummer.

              If only L1 was 4-way associative ( even if smaller), life would be a whole lot easier.

              As things are now, practically every task needs its data from at least two different locatins, which means that both of L1 ways are always busy.

              So, if you try to prefetch some data, it will prefetch to L1, almost certainly push out data that is actually needed and which will be reuested almost immediately.

              Even with 2-way associativity of L1 life would be much easier if obe could prefetch to L2 or L3, but currently all PrefetchLevel instructions alias to L1...

              I know that L1 cache is expensive in silicon estate terms, since it has to be fast and associativity doesn't come cheap, but really, 2-Way is too low.

              4-Way would be  much better, escpecially with working Prefetch instructions...




                • SSE performance mismatch


                  Thanks for your note.. I did not know about the K-8 and K-10 thing..Good to know!

                  The intel L1 is 16-way associative.. THats simply too good!

                  But I find the AMD desktop to be super-fast. May b, the deeper cache (less associative) is good for word-like applications and the broader caches (more associative) is good for scientific apps. Just my musing...


                  Best Regards,