1 Reply Latest reply on Apr 19, 2009 9:51 AM by avk

    L3 cache performance Barcelona

    moebiusband

      I try to understand L3 cache performance on the AMD Barcelona processor.

      I do simple data transfer micro benchmarks as load, store and copy for streaming loop based code kernels..

      Load: I use negative index to save the cmp instruction.

      .align 32
      1:
      movaps xmm0, [rdi + rax * 8]
      movaps xmm1, [rdi + rax * 8 + 16]
      movaps xmm2, [rdi + rax * 8 + 32]
      movaps xmm3, [rdi + rax * 8 + 48]
      add rax, 8
      js 1b

      For a vector length of 1024 doubles I measure average 2.2 cycles per cacheline update which is reasonable If taking into account the two 16 byte loads per cycle. 32 byte per cycles results in 2 cycles for the 64 byte cacheline. With a data set fitting into L2 I (16384 doubles) I measure nearly 8 cycles per cacheline update. This also makes sense: Miss in L1 -> check L2 -> hit. Load cacheline fom L2 to L1 and copy back evicted cacheline from L1 to L2 -> 128 byte with 32 byte/ cycle L2 to L1 bandwidth results in 4 transfer cycles. This results in 2 cycles for the update of the cacheline in L1 plus 4 cycles for the data transfer between L2 and L1 plus some possible latency cost, which could  be hidden for my sequential load stream.

      Now with L3 things get more complicated: Data set fits into L3. Miss in L1 -> miss in L2 -> hit in L3 -> Load cacheline directly to L1 -> copy back evicted cacheline from L1 to L2 -> copy back evicted cacheline from L2 to L3. As I read the L3 is not  run with core clock. If I assume it runs with core clock I get 2 cycles for the Load from L3 to L1 plus 4 cycles for the cacheline evicts plus of course the 2 cycles I still need to update the cacheline in L1. Results in 8 cycles for the load stream if negelecting any latency. If add the overhead cycles I get 2 cycles more means 10 cycles. But I measure over 16 cycles per cacheline update. That means I pay  6 cycles for L3 access per cacheline which is not good for a pure streaming data load. On my machine this results an effective bandwidth of only 7481 MB/s  from L3 cache. Is this reasonable or do I miss something ?

      Store:

      The code is similar to the above, I load some data into registers and subsequently store away this data to a vector. I have a store miss in L1 -> Store miss in L2 -> Hit in L3. Allocate cacheline from L3 to L1.  The cacheline can be deleted from L3. But a cacheline has to be evicted from L1 to L2 to L3.

      A store to L2 should need 8 cycles (4 cycles L1 update (only 16 bytes per cycles) and 4 cycles for transfering the cachelines between L1 and L2. I measure 12 cycles for store to L2 and 18 cycles for store to L3.  For completeness I measure 38 cycles for copy. Slightly more than just to add the duration for the pure load/store but close.

      Are my assumptions to the data paths correct? Are the additional cycles apart from the pure data transfer cycles reasonable? I know that this is a synthetic case, I have no cache reuse at all, still data streaming is common and the raw L3 bandwidth available to streaming is important.

      Thanks for your help!