- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1950x multi-threaded performance
Hi,
I have a performance problem with a multi-threaded (CPU and memory intensive, not I/O intensive) program on a 1950X (ASUS PRIME X399-A motherboard,
Corsair Vengeance LPX DDR4 4x16GB@3000 MHz memory): the performance drops by 50% when going from one to four threads. After having excluded semaphore locks and such as the cause of the problem I decided to run the same program on an Intel i7-7820HQ (Dell Precision 7520 motherboard, DDR4 4x16GB@2400MHz memory) in which case the performance drops by only 10%. OS is ubuntu 18.04, kernel version is 4.15.0-38-generic, GCC version is 7.3.0.
Any ideas what could be causing this difference/how I can improve the multi-threaded performance on the 1950X?
Thanks,
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What program?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A C program developed by me that plays international draughts. The alpha-beta search (CPU intensive and memory intensive due to the hash tables) is multi-threaded: threads publish work (nodes to be searched) to which other threads can subscribe.
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gijsbert, this, at least might, be interesting: Level1​ It does have some interesting tools. Enjoy, John.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have done some more research into this. I have used different profilers (gprof and my own high-resolution code-profiler), different compilers (gcc and aocc/clang), different algorithms (no hash tables, replaced semaphores by crc32 protected memcpy) but the results are the same: the self-time of all functions (also simple functions that do no call other functions and are not often invoked) slow down by an average factor of 0.6 when going from 1 to 2 threads and by an average factor of 0.4 when going from 1 to 4 threads. The only thing I notice at the operating system level is that when I execute 1/2/4 threads 'cat /proc/cpuinfo' shows 2/4/8 CPU's going from 2.1 to 3.7Ghz, whereas you perhaps expect 1/2/4 CPU's. 'htop' shows the expected 1/2/4 threads and the corresponding 100/200/400% CPU usage.
Any suggestions?
Regards,
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have now run a couple of sysbench (version 1.0.11) benchmarks, and the 'sysbench --test=memory --num-threads=N run' shows that 'MiB transferred/sec' decreases from 5564, 3024 to 2154 for 1/2/4 threads on my 1950x system, but increases from 5944, 7010 to 9272 for 1/2/4 threads on my i7-7820HQ system..
How does this test scale on your 1950x system?
Regards,
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
gwiesenekker, is all your testing on Linux? Can you suggest a similar W10 test? That upside down results with your i7 above may interest AMD. Enjoy, John.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
A comment in the memory benchmark from SiSandra 'We finally discover an issue – TR (just like Ryzen) memory latencies (in-page, random access pattern) are huge – almost 3x higher than Intel’s.' allowed me to find the root-cause: you have to set the thread affinity on 1950x! My first attempt (associate thread 0 with CPU 0, thread 1 with CPU 1 etc.) already greatly improved the multi-threaded performance of my program.
Regards,
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI, here are 'sysbench --test=memory --num-threads=N run' results without and with setting the thread affinity. They speak for themselves:
$ sysbench --threads=1 --test=memory run | grep -i mib/sec
62991.16 MiB transferred (6297.75 MiB/sec)
$ sysbench --threads=2 --test=memory run | grep -i mib/sec
31019.36 MiB transferred (3101.29 MiB/sec)
$ taskset 0x3 sysbench --threads=1 --test=memory run | grep -i mib/sec
61560.52 MiB transferred (6154.75 MiB/sec)
$ taskset 0x3 sysbench --threads=2 --test=memory run | grep -i mib/sec
102400.00 MiB transferred (10305.26 MiB/sec)
Regards,
Gijsbert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks much, gwiesenekker. I will ask again: do you know of a Windows test that will expose this? Thanks and enjoy, John.