Does anyone no how to contact someone within AMD to see if AMD would be willing to donate a Ryzen based machine to be part of the gcc compile farm?
I recently did some work with optimizations of matrix multiplication routines for gfortran. We found some AMD machines could not handle AVX instructions well on some of the older APU devices so we had to disable use of some of these instructions. I want to be able to make sure we can include the Ryzen series chips in the optimizations, especially if they support AVX256 and AVX512 instructions. We can not do much if we do not have a machine to test on. Most of us are volunteer developers so it is not like we can go buy a new machine.
Any help/suggestions would be appreciated.
(PS yes Fortran is still used for a whole bunch of advanced scientific and engineering computation. gfortran supports most features up to and including F77, F90, F95, F2003, F2008, and F2015 (latest standard not final) )
I am not an amd employee, but it is great to hear that u r doing development
I am a big user of fortran 90 for turbulence simulation with spectral methods.
This uses loads of matrix multiplications and avx instructions in fp64.
Most scientific problems that solve partial differential equations are in fp64,
and, there are 10s (if not 100s) of billion worth of legacy fortran code around.
So for gvnmts and academia fortran fp64 processing is a must.
A couple of years ago I tried my big fortran code on an intel chip with both
gfortran and ifort and the ifort executable was 4.5 time faster. gfrortran
seems in need of some work.
I am here because I use amd gpus with opencl as accelerators to great effect
but I am limited by the front end execution rates and the limited number of
pci lanes. So I am looking for a replacement cpu with decent fortran-based fp64
rates which at the moment appears to be an intel monopoly.
I rely on the euler3D code benchmarks published in various sites (like tomshardware)
as a guide and at the moment ryzen is not doing well in this. Please use this
code as well to guide you in your optimization. Code info is found in
I hope amd responds favourably to yr request.
Our latest improved MATMUL are in the 7.0.1 experimental trunk. You should see very significant speed up on large arrays. For what it is worth we are inlining the smaller arrays and you can adjust the size parameter for where inlining is done vs calling the intrinsic routines. It is set at about size 30 as I recall for the crossover.
On the AMD processors we had trouble with one has to set AVX to us 128 which seemed to be a non standard setting so we did not do it. So even though AVX 256 was supported the implementation was poor. I am sure this issue was not related to all AMD processors, it seemed to on the earlier APU chips.
I started to investigate how to offload to GPU with OpenCL and did not have tie to explore it more. gcc/gfortran now supports OpenACC which can do some GPU offloading, but I a not familiar enough with the workings of these things. It involves using OpenMP directives. OpenMP is well supported now in gcc and gfortran.
One advantage Intel compilers have is well developed computational libraries that work well with their frontend compilers. We did some tests with gfortran using Intel libraries and found that gfortran was head to head with intel as far as the frontend compilers. So considering one can use many different computational libraries, gfortran is quite viable as a tool. I would like to learn more about using GPU to further improve gfortran,
Regarding AVX, I know there was a nasty bug in Piledriver which made 256-bit stores very slow. Piledriver was used in the Trinity APU which succeeded Llano, as well as in the (only just retired) Vishera CPU. Newer APUs use Steamroller or Excavator, both of which have that bug fixed.
Common to all AMD implementations of AVX so far is that they are split into a pair of ops, each directed to a 128-bit SIMD FPU. If both FPU pipelines happen to be available on the same cycle, you still get single-cycle throughput of AVX instructions, with equivalent latency to equivalent SSE instructions. Otherwise, they'll go through on successive cycles, either both through the same FPU pipeline or simply staggered through both. This is true of Piledriver, Steamroller, Excavator and Ryzen alike - though Ryzen has twice the FPU bandwidth per core, compared to its predecessors.
The philosophy might be that on the one hand, highly intensive AVX workloads are uncommon, and on the other hand, such workloads are often better suited to GPUs (which AMD also makes, conveniently enough). Fortran falls awkwardly in the middle of that, unless someone finds a way to compile Fortran for a GPU target.