Running the exact same code on Intel gives me a much better increase (>24%), so I guess there is something else I can do to improve the code in question.

What I am looking for is some advise on how to better tune the code for my Opteron servers, and wisdom one how to better utilize SSE extensions on these procs. At a different forum somebody mentioned that the problem could be with register pressure, and to be completely honest I am not sure what I should be looking at.

I have run Code Analyst but I am not clear what I need to measure, and where the highest penalties are, instructions retired are not giving me enough information, what other data do people look at in situations like this?

I am also new to SSE..

I implemented small matrix multiplication using SSE2. It runs twice fast on my Intel CPUs. But on AMD, the implementation runs slower than the non-SSE one... :-(

Yes, It is double-precision math... (and hence 2x speedup)

The matrices are 20x20 in dimension... Both matrices can be fully contained in the caches.

The INtel L1 cache is 8-way associative whereas AMD is 2-way associative (but more sets in case of AMD).

I use non-temporal writes to write out the result matrix so that Caches are not destroyed. (http://lwn.net/Articles/255364/ - for info on non-temporal writes)

But still, i wonder why AMD Athlon X2 dual core is not enjoying SSE2.

Any ideas?