It will be very helper if we can analyze the cost of each instruction or each OpenCL line.
Either ROCm or AMDGPU driver is fine.
Thanks in advance.
In static analyzer mode, CodeXL supports navigating through the ISA code to find the estimation for instruction cost in clock cycle. I'm not sure if this information helps you.
Currently CodeXL supports two levels of profiling only -1) API timeline trace and 2) kernel level performance counters.