I am attempting to use CodeXL to determine hardware performance counts (total cycles, instruction count, branches, branch misses etc...) and it seems that it works but the numbers are low. Perhaps I am interpreting them incorrectly. I am running the SPEC 2006 floating point benchmark 410.bwaves . Below is a sample of my CodeXL results along with results I attained by using PerfSuite which uses PAPI counters.
Ret Inst: 9,645,523
CPU Clocks: 15,513,606
Ret Branch: 3,907,189
Total Instructions: 2,180,016,132,220
Total Cycles: 2,459,547,852,840
Branch Instr: 127,643,611,345
I understand the difference between 'total' instructions and 'retired' instructions but I would expect these numbers to be closer together. Likewise, SPEC 2006 applications will retire much, much more than 9M instructions for a full reference run. The total execution time is similar (perfsuite: 651s vs. CodeXL: 713s) which is another reason why I'm confused.
Is there some calculation that I am missing? Such as multiplying the counter statistics by some formula like samples per second x reference interval?