Discussion created on Nov 13, 2007
I have been doing some optimizations on some code (Kdtree traversal), and decided to give SSE via intrinsics a try (technique called packet tracing). The code is complete and correct, but performance is dissapointing (~12% speed improvement), while I was expecting a much larger increase.

Running the exact same code on Intel gives me a much better increase (>24%), so I guess there is something else I can do to improve the code in question.

What I am looking for is some advise on how to better tune the code for my Opteron servers, and wisdom one how to better utilize SSE extensions on these procs. At a different forum somebody mentioned that the problem could be with register pressure, and to be completely honest I am not sure what I should be looking at.

I have run Code Analyst but I am not clear what I need to measure, and where the highest penalties are, instructions retired are not giving me enough information, what other data do people look at in situations like this?