5 Replies Latest reply on Apr 19, 2009 7:15 AM by jack2009

    Multithreading performance


      Hi, I am experience no performance boost when using OpenMP in my code!

      First time when I write it I notice x2 performance boost on Phenom 9950 in WVistax64! After I modify my code a lot (add SSE3 support in critical section (FFT for example) more parallelization, but when I test it I notice previous  x2 boost!!!!!!!!!!! I develop soft on my laptop with AMD Athlonx2 QL-62 and I experiance perf boost each version !!!!!!!! So why it happen? Probably it is poor mem-cache organization!!!!!!! I use all special techniques like data alignment and compact packing! So why to not implement SERIAL CONNECTION TO MEMORY IN YOUR CPU????????????? In this case you can collaborate with some memory manufacturers to develop for example memory module with 4 or more serial chanels (it can be faster than today parallel!) So each core can access memory at same time!

      WHY NOT???

        • Multithreading performance

          Hi Godsic,


          It sounds like you are doing some interesting optimization work.  There is quite a bit of ongoing activity in this area, see some of the citations/book links below, that may help you compare your design and implementation with other designs that have been implemented on our hardware.


          It sounds, however, that your main concern is the design of our memory interface, and that you are memory bound.


          Some of the papers below may help you understand some of the tradeoffs you’ve made in blocking and packing of your data structures.  Also, you may consult AMD’s 10H BDKG  for the specifics of our memory hierarchy, which you may want to familiarize yourself with.


          Regarding the implementation of our memory hierarchy; we do not currently support fb-dimms for many business and customer related issues, but it is on our roadmap.


          We appreciate your concern, and thank you for your input.










          OpenMP Shared Memory Parallel Programming

           By Michael J. Voss

            • Multithreading performance

              Also I follow all recommended optimization from AMD about cache efficient optimization and techniques, but I could not see boost more than x2! Also then my code was tested on Intel E8600(only two cores) I  notice x4.5 performace boost in compare to first version! How it can be? My code is memory intensive and Core2 has out-of-the box memory controller with only one 128 bit channel! I try unganged and ganged regimes on Phenom, but only x2 and thats all! It is very strange! Basically I hate Intel processor and decide to buy Phenom II 940 instead of i7 920! I hope AMD fix some memory-cache transfer issues in second generation of Phenom! But I strongly recommend AMD to use memory with multichannel serial interface! Only one makes me happy - my code run faster on Phenom 9950(@3Ghz) in compare to one node of SGI Altix (Xeon Clowertown based)!