Hei guys, I am writing you today because I had some surprising findings after optimizing some hot loops for my Ryzen 1700. The specific code was a deformer for Autodesk Maya, the deformer takes as an input a polygonal geometry and as an intermediate step of the computation performs 20-40 Laplacian smoothing, which takes overall 80~ % of the whole computation, so that s where my focus was on optimizing.
In case you wanted to know the Laplacian smooth is a simple smooth which to compute the position of a vertex, averages the position of the neighbors. (see image below):
If I want to compute the smoothed position of the green vertex, I will simply add up the position of the circled neighbors and divide by 4, pretty straight forward, the rest of the computation deals with the loss of volume.
The code is all AVX2 based and the data is in SOA. In order to fetch the data it uses gathers instructions. The first implementation would take 7.8 milliseconds on Ryzen and 5 milliseconds on the 4790k. At that point I was trying to justify the 40%difference with only a 15% difference in core speed (3.7 vs 4.4) After some timings I realized the issue was on my cache usage, it was too aggressive since doing several gathers, that meant 8 cache lines at the time, Ryzen got much much smaller caches and that s why it was suffering. I changed the code back in being still AVX2 but to work on AOS data layout, now I got much fewer cache misses and I got a massive boost of around 50% on Ryzen and a few % on the intel. After all those specific optimizations, and Ryzen overclocking to 4.0 ghz the final timings were the following:
Ryzen 1700: 3.85 millisecond
Intel 4790k: 4~ millisecond
Accordingly to this post: Ryzen's halved 256bit AVX2 throughput - AnandTech Forums
Ryzen is issuing two microcode instructions for every the AVX2 instructions, doing a 2x128, that made me interested and went to check the Agnerfog instructions timings for both my Haswell and ryzen architecture: http://www.agner.org/optimize/instruction_tables.pdf
Overall the latency and reciprocal throughput are higher compared to the Haswell architecture. Which left me even more puzzled in the result I was seeing.
TLDR
I am trying to figure out how a core with lower freq and supposedly slower AVX2 instructions, lower cache sizes can beat a processor with higher freq, faster instructions, and bigger caches.
Don't get me wrong, I am not here to bash on the 1700, the opposite, I am quite amazed. I am trying to figure out what other architectural factors might justify the timings I am seeing.
The code is single threaded, so the infinity fabric should not influence, the memory access is quite random in memory due to the nature of the algorithm so in this specific case, I am not sure the new prefetcher in Ryzen can do miracles. The Ryzen system has a faster ram, 2930 Mhz compared to 1666 on my Intel, but I actually checked the performance with ram at 1333 on ryzen, before any software optimization the ram bump in speed gave me only around 400 microseconds speed up from 8200 micros to 7800. Finally, I checked the generated assembly on both platforms and was pretty much identical.
I would love to discuss this topic if anyone is interested, I can also provide the snippet of code since is fairly contained.