2 Replies Latest reply on Sep 14, 2010 12:29 PM by pdrongowski

    Can someone explain something?

      Why is: add rax,rcx so costly?

      I have seen the following in the Code Analyst breakdown of a small function, each line of machine code is executed the same number of times.

      mov rax,[rdx]

      mov rax,[rcx+rax+00000150h]

      add rax,rcx


      Now the first 'mov' line takes 4 CPU clocks, the second 'mov' line takes 2 CPU clocks, but the 'add' line takes 63 CPU clocks (this is what appears in the source code stats in Code Analyst).

      I'm seeing this sort of unexpected disparity in many places as I profile a large API and test programs. Innocent looking machine instructions that appear to take far longer than similar ones nearby.

      Is the displayed 'CPU clocks' reliable? (the 'hot' instructions don't seem to change so I guess they are).

      This is in optimized x64 C code running under Vista x64 with an Athlon X2 6000+ and 8GB RAM and benchmarked using Visual Studio 2008 with the Code Analyst addin.

      The profiling used is simply "Assess Performance".

      Many of the 'hot functions' we are sooming in on, often end up having this kind of bizarre cause, isolated little instructions that seem to be consuming lots of cycles.

      I'm no guru on the internals of the x64 processors or the timings of the x64 instruction set, but these numbers do look suspicious.

      I'd appreciate any insights into what is going on here.










        • Can someone explain something?

          I'm not a CodeAnalyst expert, but perhaps it's taking into account the pipeline stall due to the memory fetch. The data read by the second 'mov' isn't actually needed until the 'add', so perhaps the fetch time is attributed to the 'add' instead of the 'mov'.

          • Can someone explain something?

            Hello Hugh --

            CodeAnalyst uses statistical sampling to collect the data for its profiles. The "Assess Performance" configuration samples the hardware performance counters. One of the performance counters is configured to measure 0x076 CPU Clocks Not Halted. When a predetermined number of CPU clocks have occurred, the counter hardware causes an interrupt and the CodeAnalyst driver collects a sample. The "predetermined number" is the sampling period and is 250,000 for the stock "Assess Performance" configuration.

            The CPU hardware does not retain the exact instruction address (IP) for the instruction that crossed the event threshold. The driver must use the restart address (i.e., the return address for the interrupt) to associate the sample with a code region. The restart address is "near" the actual culprit instruction. In an out-of-order execution machine, the skid from the culprit instruction to the point of interrupt can vary and accumulates.

            So, in practical terms, the CPU clocks profile identifies hot code regions, but not individual hot instructions.

            -- pj