I am trying to use cache-aware Roofline models for Zen2 and if possible Zen3 processor models. Roofline curves provide a great guidance on the quality of parallelization of shared-memory codes. They can reveal how close is the performance of a particular parallelization and which direction to follow in order to improve it further (eg. cache blocking, SIMD instructions, etc.)
Does AMD offers system level Roofline curves for their models? The information that is needed is the per core
single and double precision scalar and vector arithmetic limits (eg add or FMA) per core at max clock
scalar and vector read (or read+write) BW for L1, L2, L3 and DRAM per core
Are the cache hierarchy BW figures per published somewhere or is there any AMD binary that we can use to gauge those?
Ideally, for a particular model I care about (eg 7542) I gauge the L1, L2, L3 and DRAM and vector/scalar arithmetic per core or per n cores and I can generate the Roofline curves myself manually.
It would be great if the AMDuProfiler would generate the Roofline curves for the particular system that code profiling takes place.
At the very minimum, validated AMD binaries should be able to gauge and report the the L1, L2, L3 and DRAM and vector/scalar arithmetic per core or per n cores for the system we are interested in. Are any of these cache hierarchy BW codes available to us?
Solved! Go to Solution.
Hi @drmichaelt7777 ,
Thank for your suggestion regarding the Roofline model support. We will plan to add this to our future releases.
For L1/L2/LLC/DRAM analysis, you can try AMDuProfPcm tool, comes with uProf. More details about AMDuProfPCM can be found in the User Guide document.
Thank Swarup, that would be a much needed piece of information we can use to optimize our stencil codes on AMDs. Our workflows are seriously relying on stencils and we deploy codes that burn 100s of 1000s of CPU cycles.
In the meantime, can AMD furnish low level codes to measure BW / core for L1, L2 and L3 cache levels? We are analyzing the performance of different AMD models on-prem and on-cloud and having codes that accurately gauge BW at all levels of cache hierarchy will allow us to automate this process and apply autotuning on our codes.
We do have NDAs with AMD. I hope that communicating directly with AMD engineers can kick start the process faster.
Can we access these cache BW micro-benchmark codes?