2 Replies Latest reply on Feb 14, 2012 7:34 PM by fabien.gaud

    Troubleshooting with HyperTransport link hardware counters

    fabien.gaud

      Hi,

       

      We are trying to evaluate the usage of HyperTransport links. To that purpose, we are using the following hardware counters: 0x0F6, 0x0F7 and 0x0F8 (our processors have three HT links). To obtain the usage, we divide the amount of data (mask 0x37) by the number of nop+data (mask 0x3f). We use perf to periodically gather the value of these counters on each processor (we also tried with an old kernel and perfmon).

       

      We want to benchmark two architectures: one architecture composed of 4 AMD Opteron 8356 processors (Barcelona, 4 cores per processor) and one architecture composed of 4 AMD Opteron 8435 (Istanbul, 6 cores per processor). The processors are interconnected through HT Links (version 1.0 for the 16-core machine and 3.0 for the 24-core machine) and the interconnect topology is the following for the two machines (P = Processor):

             ------     ------

      I/0 -- | P0 |-----| P1 |

             ------     ------

                |     /   |

                |    /    |

                |   /     |

              ------     ------

              | P2 |---- | P3 | -- I/O

              ------     ------     


      We ran a cpu burn benchmark (one thread per core which basically spins on a register) and we expected to see a very low link usage. On the 16-core architecture, we measured an average usage lower than 1%. However, on the 24-core architecture, we obtained very surprising measurements. Indeed, links of all processors are used at 50% (except the I/O links).

       

      In other words, these hardware counters do not seem consistent.

      Note that we tried with a microbenchmark performing memory accesses to different memory locations (in terms of memory nodes). The results are consistent on the 16-core architecture but are also weird on the 24-core architecture.

       

      Has somebody already experienced such weird issues with this Istanbul architecture? Or did I misunderstand something?

       

      Thanks in advance for your help,

        • Re: Troubleshooting with HyperTransport link hardware counters

          Hello,

           

          I spoke with some other engineers.  I hope this is useful.  You might consider moving to a G34-based platform.

           

          The Barcelona parts do not have an HT assist, while the Istanbul parts do.  While the HT assist is on, the probe traffic is reduced significantly, so the coherent HT should be freed from a lot of the traffic.  So, if you do small amounts of work or the system is idle, the traffic on the Barcelona system should be higher than the Istanbul.

           

          One way to verify your results would be to do a remote bandwidth test, like stream.  You can compare your bandwidth measurement on the coherent HT link with the stream result and they should roughly match.

           

          Some things to consider with your bandwidth measurement:

          • The counters on a coherent HT link only measure transmitted data, not both transmitted and received.
          • The northbridge counters for a particular node are shared among all cores for that node, so you will probably want to separate the data by node and divide by the number of cores in that node.
          • To get bi-directional bandwidth, you also need to consider the data that other nodes are transmitting.
          • For NUMA node 0, there are also some split coherent HT links and the non-coherent HT link for the PCI devices.

           

          You could apply a principium of conservation of data volume: what the memory controller is processing from memory banks is what is flowing through the coherent HT and being consumed by the cores.  We recommend the Family 10h BKDG (for Barcelona and Instanbul) to you for more information about the performance events.

           

          From memory, the maximum traffic through coherent HTlinks on Barcelona was around 2.5 GB/s bidirectional and would be around 5 GB/s on Instanbul.  What results do you get from the stream benchmark?

           

          I hope this helps,

          -=Frank

          1 of 1 people found this helpful
            • Re: Troubleshooting with HyperTransport link hardware counters
              fabien.gaud

              Hi Frank,

               

              Thank you very much for your answer. I agree with you, the traffic on the HT link must be lower on the Barcelona architecture than on the Istanbul one.

              Actually the Barcelona architecture has consistent results (no traffic on the link with a CPU burn), whereas the Istanbul architecture always returns at least 50% of HT link usage...

               

              I did the following experiment. I pinned the memory on node 1 and access it from node 0 (read access are performed by all cores on the node). It's a home-made benchmark, but stream gives very similar results.

               

              * On the Barcelona architecture, I measured a throughput of 2.8GB/s. The link between node 0 and node 1 is used at 28% and the link between node 1 and node 0 is used at 99%. Other links are slightly used for the cache coherency protocol (less than 10%).

               

              * On the Istanbul architecture, I measured a throughput of 4.2GB/s. The link between node 0 and node 1 is used at 73% and the link between node 1 and node 0 is used at 87%. Other links are used at 50% (but I cannot believe it, probably the same problem as the one observed with cpuburn).

               

              Note that if I perform write accesses instead of read accesses, the link usages are inverted (i.e., the link between node 0 and node 1 is the most used), but the conclusions are the same.

               

               

              Thanks for your help,

              Fabien