Archives Discussions

godsic · ‎04-03-2009

AMD?

Hi, I am experience no performance boost when using OpenMP in my code!

First time when I write it I notice x2 performance boost on Phenom 9950 in WVistax64! After I modify my code a lot (add SSE3 support in critical section (FFT for example) more parallelization, but when I test it I notice previous x2 boost!!!!!!!!!!! I develop soft on my laptop with AMD Athlonx2 QL-62 and I experiance perf boost each version !!!!!!!! So why it happen? Probably it is poor mem-cache organization!!!!!!! I use all special techniques like data alignment and compact packing! So why to not implement SERIAL CONNECTION TO MEMORY IN YOUR CPU????????????? In this case you can collaborate with some memory manufacturers to develop for example memory module with 4 or more serial chanels (it can be faster than today parallel!) So each core can access memory at same time!

WHY NOT???

stroia · ‎04-03-2009

Hi Godsic,

It sounds like you are doing some interesting optimization work. There is quite a bit of ongoing activity in this area, see some of the citations/book links below, that may help you compare your design and implementation with other designs that have been implemented on our hardware.

It sounds, however, that your main concern is the design of our memory interface, and that you are memory bound.

Some of the papers below may help you understand some of the tradeoffs you’ve made in blocking and packing of your data structures. Also, you may consult AMD’s 10H BDKG for the specifics of our memory hierarchy, which you may want to familiarize yourself with.

Regarding the implementation of our memory hierarchy; we do not currently support fb-dimms for many business and customer related issues, but it is on our roadmap.

We appreciate your concern, and thank you for your input.

Regards,

Sharon

http://www.ece.cmu.edu/~franzf/papers/sc06.pdf

http://people.sc.fsu.edu/~burkardt/f_src/fft_open_mp/fft_open_mp.html

OpenMP Shared Memory Parallel Programming

By Michael J. Voss

godsic · ‎04-05-2009

Also I follow all recommended optimization from AMD about cache efficient optimization and techniques, but I could not see boost more than x2! Also then my code was tested on Intel E8600(only two cores) I notice x4.5 performace boost in compare to first version! How it can be? My code is memory intensive and Core2 has out-of-the box memory controller with only one 128 bit channel! I try unganged and ganged regimes on Phenom, but only x2 and thats all! It is very strange! Basically I hate Intel processor and decide to buy Phenom II 940 instead of i7 920! I hope AMD fix some memory-cache transfer issues in second generation of Phenom! But I strongly recommend AMD to use memory with multichannel serial interface! Only one makes me happy - my code run faster on Phenom 9950(@3Ghz) in compare to one node of SGI Altix (Xeon Clowertown based)!

avk · ‎04-07-2009

Debugging, testing, profiling. There is no other way to achieve a maximum performance.

godsic · ‎04-07-2009

How to profile application if you have pure SSE2/SSE3 code in loop, which is parallelized with OpenMP? All data aligned to 16 byte and packed for cache efficient using!

It is case when performance fully depend on architecture of CPU!

jack2009 · ‎04-19-2009

Among other things, you need to understand weak memory models.

Hereby incorporating by reference Brad Abrams' discussion of volatile and MemoryBarrier(). In particular, Vance Morrison's discussion of memory models is important reading.

Edit: Removed advertising from post.

Archives Discussions

Multithreading performance