I am also new to SSE..
I implemented small matrix multiplication using SSE2. It runs twice fast on my Intel CPUs. But on AMD, the implementation runs slower than the non-SSE one... :-(
Yes, It is double-precision math... (and hence 2x speedup)
The matrices are 20x20 in dimension... Both matrices can be fully contained in the caches.
The INtel L1 cache is 8-way associative whereas AMD is 2-way associative (but more sets in case of AMD).
I use non-temporal writes to write out the result matrix so that Caches are not destroyed. (http://lwn.net/Articles/255364/ - for info on non-temporal writes)
But still, i wonder why AMD Athlon X2 dual core is not enjoying SSE2.
AFAIK K-8 family has much slower SSE unit than K-10 ( Phenoms etc) and roughly 1/2 of the bandwidth to the L1.
Phenom can do one operation on whole 128-bit SSE register in 1 cycle while old K-8 needs two cycles ( it does its thing in 64-bit chunks)...
Also, only 2-way associativity ond lack of full functionality of PrefetchLevel instructions are totall bummer.
If only L1 was 4-way associative ( even if smaller), life would be a whole lot easier.
As things are now, practically every task needs its data from at least two different locatins, which means that both of L1 ways are always busy.
So, if you try to prefetch some data, it will prefetch to L1, almost certainly push out data that is actually needed and which will be reuested almost immediately.
Even with 2-way associativity of L1 life would be much easier if obe could prefetch to L2 or L3, but currently all PrefetchLevel instructions alias to L1...
I know that L1 cache is expensive in silicon estate terms, since it has to be fast and associativity doesn't come cheap, but really, 2-Way is too low.
4-Way would be much better, escpecially with working Prefetch instructions...
Thanks for your note.. I did not know about the K-8 and K-10 thing..Good to know!
The intel L1 is 16-way associative.. THats simply too good!
But I find the AMD desktop to be super-fast. May b, the deeper cache (less associative) is good for word-like applications and the broader caches (more associative) is good for scientific apps. Just my musing...