1 of 1 people found this helpful
I came here looking for low level details...
A slide deck on the subject got leaked a while ago. The executive summary, as far as I can remember it:
- Don't use non-temporal accesses (unless you REALLY know what you're doing, and you probably don't).
- Don't use manual prefetching. The automatic prefetchers work better, and don't consume decode bandwidth or op-cache space.
- Organise your data in memory so that the automatic prefetchers are maximally effective. This may involve using structs-of-arrays instead of arrays-of-structs, or vice versa, depending on access patterns.
- Minimise data movement between CCXes, as the bandwidth available between them is significantly less than within them. This may involve careful choice of worker-thread count and affinity.
- SMT is new to AMD, but works similarly to Intel's HT and has similar tradeoffs. Ensure any thread affinity settings account for this.
Aside from the above, it is implied that Ryzen mostly responds well to code optimised for Intel CPUs. If the older AMD-specific ISA extensions are avoided, code optimised for older AMD CPUs should also run well, as long as the above guidelines are also accounted for.
Interestingly, adjusting existing code for the above guidelines seems to have a small net positive effect on Intel CPUs as well. This may obviate the need to have separate Intel and AMD code paths.
Agner Fog says he's nearly finished adding his analysis of Ryzen to his own famous optimisation manuals. This will no doubt be very illuminating.
Nevertheless, an official optimisation guide would be better than relying on leaks and random forum posts.
Some requests for documentation in the Processor support forum also have related information, including a links to InstLatx64 instruction and memory latency tables, the Optimizing for Ryzen presentation, and performance counter changes.