Using C code, I mmap to a BAR on a Pcie Gen2x4 endpoint, and write and read to/from it. The sizes of these are typically a few bytes at once.
The endpoint enumerates the same on a comparable Intel (Gen2x4) however the AMD is absolutely terrible in terms of read latency, an Intel processor will reply back within 290ns, the AMD takes over 15us. The writes are also 2x slower, we believe this is because the AMD splits up the PCie transactions into more TLPs for yet another unknown reason.
This is consistent, even across multiple processors in the AMD Epyc 3xxx lineup. It's almost like the AMD processor is waiting for more data before performing the command, but then gives up after 15+ us and simply sends the PCie packets.
During CPU pcie reads, we can see via a PCIe analyzer that the bus is idle during this 15+ us time, and we can even see a SKPR set being introduced, which doesn't happen on the Intel because the bus is not idle. One core stays pegged at 100% and is otherwise idle when not running the code, so it's not a loading issue.