It sounds like you are doing some interesting optimization work. There is quite a bit of ongoing activity in this area, see some of the citations/book links below, that may help you compare your design and implementation with other designs that have been implemented on our hardware.
It sounds, however, that your main concern is the design of our memory interface, and that you are memory bound.
Some of the papers below may help you understand some of the tradeoffs you’ve made in blocking and packing of your data structures. Also, you may consult AMD’s 10H BDKG for the specifics of our memory hierarchy, which you may want to familiarize yourself with.
Regarding the implementation of our memory hierarchy; we do not currently support fb-dimms for many business and customer related issues, but it is on our roadmap.
We appreciate your concern, and thank you for your input.
By Michael J. Voss
Also I follow all recommended optimization from AMD about cache efficient optimization and techniques, but I could not see boost more than x2! Also then my code was tested on Intel E8600(only two cores) I notice x4.5 performace boost in compare to first version! How it can be? My code is memory intensive and Core2 has out-of-the box memory controller with only one 128 bit channel! I try unganged and ganged regimes on Phenom, but only x2 and thats all! It is very strange! Basically I hate Intel processor and decide to buy Phenom II 940 instead of i7 920! I hope AMD fix some memory-cache transfer issues in second generation of Phenom! But I strongly recommend AMD to use memory with multichannel serial interface! Only one makes me happy - my code run faster on Phenom 9950(@3Ghz) in compare to one node of SGI Altix (Xeon Clowertown based)!
Debugging, testing, profiling. There is no other way to achieve a maximum performance.
How to profile application if you have pure SSE2/SSE3 code in loop, which is parallelized with OpenMP? All data aligned to 16 byte and packed for cache efficient using!
It is case when performance fully depend on architecture of CPU!