Archives Discussions

giordi91 · ‎08-02-2017

Hei guys, I am writing you today because I had some surprising findings after optimizing some hot loops for my Ryzen 1700. The specific code was a deformer for Autodesk Maya, the deformer takes as an input a polygonal geometry and as an intermediate step of the computation performs 20-40 Laplacian smoothing, which takes overall 80~ % of the whole computation, so that s where my focus was on optimizing.

In case you wanted to know the Laplacian smooth is a simple smooth which to compute the position of a vertex, averages the position of the neighbors. (see image below):

If I want to compute the smoothed position of the green vertex, I will simply add up the position of the circled neighbors and divide by 4, pretty straight forward, the rest of the computation deals with the loss of volume.

The code is all AVX2 based and the data is in SOA. In order to fetch the data it uses gathers instructions. The first implementation would take 7.8 milliseconds on Ryzen and 5 milliseconds on the 4790k. At that point I was trying to justify the 40%difference with only a 15% difference in core speed (3.7 vs 4.4) After some timings I realized the issue was on my cache usage, it was too aggressive since doing several gathers, that meant 8 cache lines at the time, Ryzen got much much smaller caches and that s why it was suffering. I changed the code back in being still AVX2 but to work on AOS data layout, now I got much fewer cache misses and I got a massive boost of around 50% on Ryzen and a few % on the intel. After all those specific optimizations, and Ryzen overclocking to 4.0 ghz the final timings were the following:

Ryzen 1700: 3.85 millisecond

Intel 4790k: 4~ millisecond

Accordingly to this post: Ryzen's halved 256bit AVX2 throughput - AnandTech Forums

Ryzen is issuing two microcode instructions for every the AVX2 instructions, doing a 2x128, that made me interested and went to check the Agnerfog instructions timings for both my Haswell and ryzen architecture: http://www.agner.org/optimize/instruction_tables.pdf

Overall the latency and reciprocal throughput are higher compared to the Haswell architecture. Which left me even more puzzled in the result I was seeing.

TLDR

I am trying to figure out how a core with lower freq and supposedly slower AVX2 instructions, lower cache sizes can beat a processor with higher freq, faster instructions, and bigger caches.

Don't get me wrong, I am not here to bash on the 1700, the opposite, I am quite amazed. I am trying to figure out what other architectural factors might justify the timings I am seeing.

The code is single threaded, so the infinity fabric should not influence, the memory access is quite random in memory due to the nature of the algorithm so in this specific case, I am not sure the new prefetcher in Ryzen can do miracles. The Ryzen system has a faster ram, 2930 Mhz compared to 1666 on my Intel, but I actually checked the performance with ram at 1333 on ryzen, before any software optimization the ram bump in speed gave me only around 400 microseconds speed up from 8200 micros to 7800. Finally, I checked the generated assembly on both platforms and was pretty much identical.

I would love to discuss this topic if anyone is interested, I can also provide the snippet of code since is fairly contained.

bridgman · ‎07-05-2018

Sorry for the late response - I just saw your post now. Take all this with a grain of salt because I'm from the GPU side rather than the CPU side, but...

The quick answer is that the rumors floating around at Ryzen launch about half-ing floating point performance were wrong. The Haswell microarchitecture allows 2 256-bit floating point instructions to be executed per clock, while Ryzen allows 4 128-bit instructions per clock... so basically same throughput. My impression was that Ryzen latency was actually less than Haswell - 3 clocks vs 5 clocks for floating point - but remember that latency does not affect throughput since the execution units are fully pipelined and can issue/retire one instruction per clock even if it takes three clocks to burp through the pipeline.

Anyways, the one place where I would expect Ryzen to lag a bit behind Haswell is when you make heavy use of MAC/FMA-type instructions - IIRC that is the one place where your 4790K has more execution hardware than Ryzen. I believe Ryzen is provisioned with separate multiple and add execution units (2 of each) which need to be combined for a MAC instruction while Haswell has 2 execution units both capable of MAC/FMA instructions.

This is pretty much hitting the limits of my CPU knowledge but hopefully will be useful.

Archives Discussions

How can my Ryzen 1700 be faster than an Intel 4790K in single thread work