10 Replies Latest reply on Jul 2, 2018 10:31 AM by giordi91

    AMD Software Optimization Guide for Ryzen

    tagoo

      The new Ryzen processors only recently come out, but it would be great if there was some estimate as to when the SOG for the Ryzen family processors will come out. Given that the architecture is brand new, essentially from the ground up, I would imagine that many of the optimization techniques in the current manuals are not necessarily valid anymore.

        • Re: AMD Software Optimization Guide for Ryzen
          escapeclause

          +1

           

          I came here looking for low level details...

          1 of 1 people found this helpful
          • Re: AMD Software Optimization Guide for Ryzen
            chromatix

            A slide deck on the subject got leaked a while ago.  The executive summary, as far as I can remember it:

             

            • Don't use non-temporal accesses (unless you REALLY know what you're doing, and you probably don't).
            • Don't use manual prefetching.  The automatic prefetchers work better, and don't consume decode bandwidth or op-cache space.
            • Organise your data in memory so that the automatic prefetchers are maximally effective.  This may involve using structs-of-arrays instead of arrays-of-structs, or vice versa, depending on access patterns.
            • Minimise data movement between CCXes, as the bandwidth available between them is significantly less than within them.  This may involve careful choice of worker-thread count and affinity.
            • SMT is new to AMD, but works similarly to Intel's HT and has similar tradeoffs.  Ensure any thread affinity settings account for this.

             

            Aside from the above, it is implied that Ryzen mostly responds well to code optimised for Intel CPUs.  If the older AMD-specific ISA extensions are avoided, code optimised for older AMD CPUs should also run well, as long as the above guidelines are also accounted for.

             

            Interestingly, adjusting existing code for the above guidelines seems to have a small net positive effect on Intel CPUs as well.  This may obviate the need to have separate Intel and AMD code paths.

             

            Agner Fog says he's nearly finished adding his analysis of Ryzen to his own famous optimisation manuals.  This will no doubt be very illuminating.

             

            Nevertheless, an official optimisation guide would be better than relying on leaks and random forum posts.

              • Re: AMD Software Optimization Guide for Ryzen
                reneg

                I'm currently writing some test code (C/C++) to find out where the performance drawbacks of Ryzen (here a Ryzen 7 2700) are. Unfortunately I don't know whom I should tell such things.

                Here a small example code:

                for(int i=0; i<bufferSize; ++i)

                {

                  ++histo_r[GETR(buffer[i])];

                  ++histo_g[GETG(buffer[i])];

                  ++histo_b[GETB(buffer[i])];

                }

                 

                for(int i=0; i<bufferSize; ++i)

                {

                auto r = GETR(buffer[i]);

                auto g = GETG(buffer[i]);

                auto b = GETB(buffer[i]);

                  ++histo_r[r];

                  ++histo_g[g];

                  ++histo_b[b];

                }

                 

                Both loops give nearly same performance on Intel Skylake. On Ryzen second loop is more as twice as fast. Going some steps further and using more registers and unrolling the loop by 4 shows a 6x speedup compared with loop#1. This optimization does NOT work on Intel Skylake.

                  • Re: AMD Software Optimization Guide for Ryzen
                    giordi91

                    Did you check the disassembly? Can you please post the disassembly for all your test cases for both Intel and amd?
                    Also is everything compiled with same flags and compiler? If so which one?

                     

                    M.

                      • Re: AMD Software Optimization Guide for Ryzen
                        reneg

                        Thx for your fast answer. I've used the same executable on both processors, no recompilation, no flag changes, optimized for speed (/O2). I'm using MSVC2017, but additionally LLVM 6.0. Both give nearly the same results.

                         

                        Here is the disassembly for MSVC2017 for all 4 loops.

                        This is the slowest one, takes 23s (but just 6s on Skylake):

                        $LL10@main:
                        movzx eax, BYTE PTR [rcx]
                        inc DWORD PTR [rbx+rax*4]
                        movzx eax, BYTE PTR [rcx+1]
                        inc DWORD PTR [rsi+rax*4]
                        movzx eax, BYTE PTR [rcx+2]
                        inc DWORD PTR [r14+rax*4]

                        add rcx, 4
                        cmp rcx, rdx
                        jne SHORT $LL10@main


                        Speedup, 10s to go (but just 6s on Skylake):
                        $LL10@main:
                        mov edx, DWORD PTR [rdi]
                        movzx eax, dl
                        inc DWORD PTR [rbx+rax*4]
                        mov eax, edx
                        shr eax, 8
                        movzx ecx, al
                        inc DWORD PTR [rsi+rcx*4]
                        shr edx, 16
                        movzx eax, dl
                        inc DWORD PTR [r14+rax*4]

                        add rdi, 4
                        cmp rdi, r8
                        jne SHORT $LL10@main


                        Additional speedup, 6s to go (yeah we hit Skylake now):
                        $LL10@main:

                        mov edx, DWORD PTR [rax]
                        mov r10d, DWORD PTR [rax+4]
                        movzx edi, dl
                        mov ecx, edx
                        shr ecx, 8
                        movzx r8d, cl
                        shr edx, 16
                        movzx r9d, dl
                        movzx edx, r10b
                        mov ecx, r10d
                        shr ecx, 8
                        movzx r11d, cl
                        shr r10d, 16
                        movzx r10d, r10b
                        inc DWORD PTR [rbx+rdi*4]
                        inc DWORD PTR [rsi+r8*4]
                        inc DWORD PTR [r14+r9*4]
                        inc DWORD PTR [rbx+rdx*4]
                        inc DWORD PTR [rsi+r11*4]
                        inc DWORD PTR [r14+r10*4]

                        add rax, 8
                        cmp rax, r12
                        jne SHORT $LL10@main

                         

                        And finally the fastest one, 4.2s (OMG 3GHz Ryzen 7 2700 is now 1.5x faster than Skylake 3.4GHz ):

                        $LL10@main:

                        mov ecx, DWORD PTR [r10-8]
                        mov edx, DWORD PTR [r10-4]
                        mov edi, DWORD PTR [r10]
                        mov r11d, DWORD PTR [r10+4]
                        movzx r8d, cl
                        mov eax, ecx
                        shr eax, 8
                        movzx r9d, al
                        shr ecx, 16
                        movzx r10d, cl
                        movzx ecx, dl
                        mov eax, edx
                        shr eax, 8
                        movzx ebx, al
                        shr edx, 16
                        movzx edx, dl
                        movzx esi, dil
                        mov eax, edi
                        shr eax, 8
                        movzx r14d, al
                        shr edi, 16
                        movzx edi, dil
                        movzx r15d, r11b
                        mov eax, r11d
                        shr eax, 8
                        movzx r12d, al
                        shr r11d, 16
                        movzx r11d, r11b
                        inc DWORD PTR [r13+r8*4]
                        mov r8, QWORD PTR histo_cpu_g$[rbp-169]
                        inc DWORD PTR [r8+r9*4]
                        mov r9, QWORD PTR histo_cpu_b$[rbp-169]
                        inc DWORD PTR [r9+r10*4]
                        inc DWORD PTR [r13+rcx*4]
                        inc DWORD PTR [r8+rbx*4]
                        inc DWORD PTR [r9+rdx*4]
                        inc DWORD PTR [r13+rsi*4]
                        inc DWORD PTR [r8+r14*4]
                        inc DWORD PTR [r9+rdi*4]
                        inc DWORD PTR [r13+r15*4]
                        inc DWORD PTR [r8+r12*4]
                        inc DWORD PTR [r9+r11*4]

                        mov r10, QWORD PTR tv2412[rsp]
                        add r10, 16
                        mov QWORD PTR tv2412[rsp], r10
                        lea rax, QWORD PTR [r10-8]
                        mov r11, QWORD PTR vData$[rbp-161]
                        cmp rax, r11
                        jne $LL10@main

                          • Re: AMD Software Optimization Guide for Ryzen
                            reneg

                            With 2nd loop and LLVM I tested with loop unroll (#pragma clang loop unroll(4)). Then the loop also executes in 6s, but doesn't reach the performance of the manually unrolled 4th loop.

                            • Re: AMD Software Optimization Guide for Ryzen
                              chromatix

                              This is actually quite interesting.

                               

                              The "slow" version involves the maximum number of memory accesses per pixel, since each colour component is loaded separately.  This is because the compiler can't assume that the histogram array and the pixel array don't overlap, and hence incrementing the histogram could have side-effects on the data read in subsequent lines of code.  The net effect is that the bottleneck is in the memory subsystem, which AMD hasn't optimised as aggressively as Intel has.  NB: Intel's extremely aggressive memory optimisations were the root cause of Spectre and Meltdown.

                               

                              The next version significantly reduces the number of memory accesses in favour of some shifts and register moves.  This is clearly a win for AMD, since there are plenty of ALU resources to spare.  But also notice that it's a 56% reduction in running time from only a 33% reduction in memory accesses; clearly there's some particular pain with either "dependent reads" or "write-to-read hazards" that's being relieved here.

                               

                              In the third version, the compiler has put two "independent reads" of consecutive pixels together, and this further reduces the hazards.

                               

                              In the final version, we have four pixels being processed per loop.  The total number of memory accesses per pixel is actually larger than in the second version, because it's running short of registers and needs to reload the pointers to the pixel array and two of the histogram arrays.  However, there is now enough going on between the pixel reads and the histogram updates for Ryzen's wider (6-way versus 4-way) op-cache front-end and move-elimination engine to shine.

                      • Re: AMD Software Optimization Guide for Ryzen
                        gc9

                        Some requests for documentation in the Processor support forum also have related information, including a links to InstLatx64 instruction and memory latency tables, the Optimizing for Ryzen presentation, and performance counter changes.

                        Ryzen Platform Optimization

                        Performance Monitoring documentation for AMD Ryzen (Zen)?

                        Need BIOS and Kernel Developer’s Guide Family 17h