I'm working on a project to add AMD aocl (particularly blis) to a product that currently uses MKL and Intel omp. This is all on 64bit Linux only. I've also switched to using GCC gomp.
I am starting to see some improvements, but there are also benchmarks with very serious performance degradations. From what I can tell this is mostly due to barriers and thread locks (mostly from our code, not blis).
For analysis I've been using Linux perf and hotspot. I'll be taking a look at uProf https://www.amd.com/en/developer/uprof.html
Are there any other AMD-specific tools for performance analysis?