I've asked this question to AMD, but they've suggested that I ask here instead...so here goes.
I have an AMD Epyc 7320P processor on a ASRockRack ROMED8-2T motherboard, with 64GB (4x16GB 3200 EEC DDR4) and a 750W PSU. I have connected the extra power connector from the PSU to the MB. The O/S installed is Linux/Ubuntu 20.04.
The motherboard has 7 PCIe slots and in these slots are 7x Mellanox ConnectX-5 dual port network adapters. The cards are Gen3 x8. 6 cards have 25G SFPs and the remaining card has 10G SFPs installed. Providing a total of 12x 25G network ports and 2x 10G network ports. The latest Mellanox OFED driver package is installed, I have customised the kernel to remove Multicast (which was causing packet counting issues). IPv6 has also been turned off.
In order to test the performance of the machine I am using Linux RDMA perftest package. Each of the network adapter ports have been placed into separate network namespaces (to force traffic on to the wire (actually fibre)), and the ports are externally looped back: 1-2, 2-3, 4-5, etc. Using RDMA to transfer a block of data from RAM, back to RAM.
Now, when running with 4 network cards installed (8 ports), I am able to achieve the full 25G throughput on all 8 ports in both directions. When adding 2 more 25G cards into the machine the performance drops to around 22Gb/s - they were simply installed - not being used. Once I add these into the performance test, I find the network throughput drops significantly; I have 4 network ports running at ~10 Gb/s, 4 at ~11Gb/s, 2 at ~20 Gb/s and 2 at ~23Gb/s. In all tests each network port maxes out a CPU core / thread - this is purely polling the RDMA work completion and inserting a tiny sleep reduces the cpu core/thread usage to around 2-4% and without any impact on the network throughput.
I have gone through many performance tuning guides, particularly the ones from Mellanox. I've reviewed the BIOS settings. I have experimented with various Linux kernel command line parameters (like disabling the vulnerability mitigations - which actually impacted performance) to no avail. I've also experimented with most of the Linux 'tuneables' and as far as I can see, I've eked out as much performance as I can.
I have spoken to the motherboard manufacturer regarding a specific BIOS setting that appears to be missing: PCIE Max Payload Size, in my setup it is fixed at 512 bytes, but according the PCIe specification - it should be able to be set to a maximum of 4KB. The option is not present in the BIOS and the motherboard manufacturer responded by saying that 'AMD have not exposed this option to BIOS writers.'
It seems that the CPU is unable to support all slots running at full speed - since each x8 slot should be able to achieve a theoretical bandwidth of around 64Gb/s - which should be plenty to support 50Gb/s network traffic - per PCIe slot.
The questions I have are:
1) What is the maximum PCIe throughput I can expect to achieve?
2) Is it correct that AMD have not exposed the Max Payload Size option?
3) How can I determine where the bottleneck is? Is it memory? Is it PCIe bandwidth? Is it something else? There are no tools to monitor memory bus/bandwidth utilisation or PCIe utilisation...
Any thoughts or ideas would be gratefully accepted.