Hi mbaker_amd,
Thanks for the reply. I intentionally left out details because I didn't want to bore people with the details. A bit more information. Our application is a Java 8 app that doesn't do anything particularly aggressive other than we have a lot of threads in our process which while not ideal is what we have to work with.
We also have a 3rd party process that consumes lots of data over a 25G SolarFlare card using onload. Their process is written in c++ and consumes 15 cores (pinned). We pin another 4 threads to CPUs that act as clients to this process, two of which are executing Java code and the other two executing c++ code.
We have two sets of metrics that we look at. The first set is provided by the vendor and can look at various parts of their pipeline. What we see is that P50 is about 10% worse than our Intel machine while P99s can be up to 2x as bad.
Finally we have our metrics which attempt to show end-to-end transaction time in our Java code. Here we're seeing 10-20% worse times in both P50 and P99 measurements.
I've run various benchmarks and am getting conflicting results. The following tests show that our Intel machines outpacing our AMD machine.
perf bench sched/messaging
perf bench memcpy movsq/movsb
perf bench memset movsq/movsb
perf bench futex/requeue
perf bench futex/lock-pi
sysbench memory --memory-access-mode=rnd --threads=64
sysbench mutex -mutex-num-1 -threads=512
Other tests like perf bench numa and default/unrolled versions of memcpy/memset show AMD machine being much faster.
I installed phoronix test suite last night on each machine and ran the stress-ng set of tests and the only test that the Intel machine seemed to do appreciably better on was the forking test.
I will look at those guides you mentioned and see if we can't find the set of parameters that helps us make our AMD machine what we had hoped it would be when we bought it. Thanks again for the look.