Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

AMD Software Optimization Guide for Ryzen

The new Ryzen processors only recently come out, but it would be great if there was some estimate as to when the SOG for the Ryzen family processors will come out. Given that the architecture is brand new, essentially from the ground up, I would imagine that many of the optimization techniques in the current manuals are not necessarily valid anymore.

10 Replies
Adept I


I came here looking for low level details...

Adept I

A slide deck on the subject got leaked a while ago.  The executive summary, as far as I can remember it:

  • Don't use non-temporal accesses (unless you REALLY know what you're doing, and you probably don't).
  • Don't use manual prefetching.  The automatic prefetchers work better, and don't consume decode bandwidth or op-cache space.
  • Organise your data in memory so that the automatic prefetchers are maximally effective.  This may involve using structs-of-arrays instead of arrays-of-structs, or vice versa, depending on access patterns.
  • Minimise data movement between CCXes, as the bandwidth available between them is significantly less than within them.  This may involve careful choice of worker-thread count and affinity.
  • SMT is new to AMD, but works similarly to Intel's HT and has similar tradeoffs.  Ensure any thread affinity settings account for this.

Aside from the above, it is implied that Ryzen mostly responds well to code optimised for Intel CPUs.  If the older AMD-specific ISA extensions are avoided, code optimised for older AMD CPUs should also run well, as long as the above guidelines are also accounted for.

Interestingly, adjusting existing code for the above guidelines seems to have a small net positive effect on Intel CPUs as well.  This may obviate the need to have separate Intel and AMD code paths.

Agner Fog says he's nearly finished adding his analysis of Ryzen to his own famous optimisation manuals.  This will no doubt be very illuminating.

Nevertheless, an official optimisation guide would be better than relying on leaks and random forum posts.

I'm currently writing some test code (C/C++) to find out where the performance drawbacks of Ryzen (here a Ryzen 7 2700) are. Unfortunately I don't know whom I should tell such things.

Here a small example code:

for(int i=0; i<bufferSize; ++i)






for(int i=0; i<bufferSize; ++i)


auto r = GETR(buffer);

auto g = GETG(buffer);

auto b = GETB(buffer);





Both loops give nearly same performance on Intel Skylake. On Ryzen second loop is more as twice as fast. Going some steps further and using more registers and unrolling the loop by 4 shows a 6x speedup compared with loop#1. This optimization does NOT work on Intel Skylake.


Did you check the disassembly? Can you please post the disassembly for all your test cases for both Intel and amd?
Also is everything compiled with same flags and compiler? If so which one?



Thx for your fast answer. I've used the same executable on both processors, no recompilation, no flag changes, optimized for speed (/O2). I'm using MSVC2017, but additionally LLVM 6.0. Both give nearly the same results.

Here is the disassembly for MSVC2017 for all 4 loops.

This is the slowest one, takes 23s (but just 6s on Skylake):

movzx eax, BYTE PTR [rcx]
inc DWORD PTR [rbx+rax*4]
movzx eax, BYTE PTR [rcx+1]
inc DWORD PTR [rsi+rax*4]
movzx eax, BYTE PTR [rcx+2]
inc DWORD PTR [r14+rax*4]

add rcx, 4
cmp rcx, rdx
jne SHORT $LL10@main

Speedup, 10s to go (but just 6s on Skylake):
mov edx, DWORD PTR [rdi]
movzx eax, dl
inc DWORD PTR [rbx+rax*4]
mov eax, edx
shr eax, 8
movzx ecx, al
inc DWORD PTR [rsi+rcx*4]
shr edx, 16
movzx eax, dl
inc DWORD PTR [r14+rax*4]

add rdi, 4
cmp rdi, r8
jne SHORT $LL10@main

Additional speedup, 6s to go (yeah we hit Skylake now):

mov edx, DWORD PTR [rax]
mov r10d, DWORD PTR [rax+4]
movzx edi, dl
mov ecx, edx
shr ecx, 8
movzx r8d, cl
shr edx, 16
movzx r9d, dl
movzx edx, r10b
mov ecx, r10d
shr ecx, 8
movzx r11d, cl
shr r10d, 16
movzx r10d, r10b
inc DWORD PTR [rbx+rdi*4]
inc DWORD PTR [rsi+r8*4]
inc DWORD PTR [r14+r9*4]
inc DWORD PTR [rbx+rdx*4]
inc DWORD PTR [rsi+r11*4]
inc DWORD PTR [r14+r10*4]

add rax, 8
cmp rax, r12
jne SHORT $LL10@main

And finally the fastest one, 4.2s (OMG 3GHz Ryzen 7 2700 is now 1.5x faster than Skylake 3.4GHz 😞


mov ecx, DWORD PTR [r10-8]
mov edx, DWORD PTR [r10-4]
mov edi, DWORD PTR [r10]
mov r11d, DWORD PTR [r10+4]
movzx r8d, cl
mov eax, ecx
shr eax, 8
movzx r9d, al
shr ecx, 16
movzx r10d, cl
movzx ecx, dl
mov eax, edx
shr eax, 8
movzx ebx, al
shr edx, 16
movzx edx, dl
movzx esi, dil
mov eax, edi
shr eax, 8
movzx r14d, al
shr edi, 16
movzx edi, dil
movzx r15d, r11b
mov eax, r11d
shr eax, 8
movzx r12d, al
shr r11d, 16
movzx r11d, r11b
inc DWORD PTR [r13+r8*4]
mov r8, QWORD PTR histo_cpu_g$[rbp-169]
inc DWORD PTR [r8+r9*4]
mov r9, QWORD PTR histo_cpu_b$[rbp-169]
inc DWORD PTR [r9+r10*4]
inc DWORD PTR [r13+rcx*4]
inc DWORD PTR [r8+rbx*4]
inc DWORD PTR [r9+rdx*4]
inc DWORD PTR [r13+rsi*4]
inc DWORD PTR [r8+r14*4]
inc DWORD PTR [r9+rdi*4]
inc DWORD PTR [r13+r15*4]
inc DWORD PTR [r8+r12*4]
inc DWORD PTR [r9+r11*4]

mov r10, QWORD PTR tv2412[rsp]
add r10, 16
mov QWORD PTR tv2412[rsp], r10
lea rax, QWORD PTR [r10-8]
mov r11, QWORD PTR vData$[rbp-161]
cmp rax, r11
jne $LL10@main


With 2nd loop and LLVM I tested with loop unroll (#pragma clang loop unroll(4)). Then the loop also executes in 6s, but doesn't reach the performance of the manually unrolled 4th loop.


This is actually quite interesting.

The "slow" version involves the maximum number of memory accesses per pixel, since each colour component is loaded separately.  This is because the compiler can't assume that the histogram array and the pixel array don't overlap, and hence incrementing the histogram could have side-effects on the data read in subsequent lines of code.  The net effect is that the bottleneck is in the memory subsystem, which AMD hasn't optimised as aggressively as Intel has.  NB: Intel's extremely aggressive memory optimisations were the root cause of Spectre and Meltdown.

The next version significantly reduces the number of memory accesses in favour of some shifts and register moves.  This is clearly a win for AMD, since there are plenty of ALU resources to spare.  But also notice that it's a 56% reduction in running time from only a 33% reduction in memory accesses; clearly there's some particular pain with either "dependent reads" or "write-to-read hazards" that's being relieved here.

In the third version, the compiler has put two "independent reads" of consecutive pixels together, and this further reduces the hazards.

In the final version, we have four pixels being processed per loop.  The total number of memory accesses per pixel is actually larger than in the second version, because it's running short of registers and needs to reload the pointers to the pixel array and two of the histogram arrays.  However, there is now enough going on between the pixel reads and the histogram updates for Ryzen's wider (6-way versus 4-way) op-cache front-end and move-elimination engine to shine.

Yeah, when I read the papers before buying the Ryzen, I already guessed that Ryzen isn't that bad as all the comparisons show. I think all these algorithms are not Zen-optimized.

So far, if I realized all the things you wrote correctly, I imagine there won't be a compiler update which could improve performance that much.


Interesting indeed, I would be curious to see if the restrict keyword would help in this case? that should give guarantee to the compiler that those pointers are not aliasing.

Adept III

Some requests for documentation in the Processor support forum also have related information, including a links to InstLatx64 instruction and memory latency tables, the Optimizing for Ryzen presentation, and performance counter changes.

Ryzen Platform Optimization

Performance Monitoring documentation for AMD Ryzen (Zen)?

Need BIOS and Kernel Developer’s Guide Family 17h