I'm not a CodeAnalyst expert, but perhaps it's taking into account the pipeline stall due to the memory fetch. The data read by the second 'mov' isn't actually needed until the 'add', so perhaps the fetch time is attributed to the 'add' instead of the 'mov'.
Hello Hugh --
CodeAnalyst uses statistical sampling to collect the data for its profiles. The "Assess Performance" configuration samples the hardware performance counters. One of the performance counters is configured to measure 0x076 CPU Clocks Not Halted. When a predetermined number of CPU clocks have occurred, the counter hardware causes an interrupt and the CodeAnalyst driver collects a sample. The "predetermined number" is the sampling period and is 250,000 for the stock "Assess Performance" configuration.
The CPU hardware does not retain the exact instruction address (IP) for the instruction that crossed the event threshold. The driver must use the restart address (i.e., the return address for the interrupt) to associate the sample with a code region. The restart address is "near" the actual culprit instruction. In an out-of-order execution machine, the skid from the culprit instruction to the point of interrupt can vary and accumulates.
So, in practical terms, the CPU clocks profile identifies hot code regions, but not individual hot instructions.