5 Replies Latest reply on Jul 23, 2009 3:08 AM by prako

    Hardware Benchmarking tools - Confusing results on Code Analyst

    prako

      Hi,
      Im using AMD opteron 4-core PC. When i parallelise(Using OpenMP) and run a serial legacy application on the 4-core i get very inconsistent results when compared to the results on a single or dual core. I want to benchmark my PC to know how well the 4-cores are being utilised and to check the load on each core. I also want to know if the program is bein run on four threads as expected(i've used default settings of openmp to spawn four threads). Could someone please suggest some good tools that would give the above information as well as the information on the memory usage etc....

       

      I've used Code analyst but the results are not clear...
      It shows four threads running on a single core.. Its pretty confusing

      Thanks

        • Hardware Benchmarking tools - Confusing results on Code Analyst
          pdrongowski

          Hi --

          I couldn't tell whether you're working on Windows or Linux?

          On Windows, I took an old serial program of mine (matrix multiply) and used OpenMP to parallelize it. I created a new CodeAnalyst project and used the thread profiling configuration to launch and collect a profile for the program. On a dual core Turion, the thread chart shows two threads with one thread scheduled to core 0 and the other thread scheduled to core 1.

          Whether you're on Windows or Linux, the overall execution time of a properly parallelised program should be shorted than the single thread version of the program. (A lot hinges on that work "properly"!) In the case of the matrix multiplication program, the single threaded program runs in 17 seconds and the dual thread (OpenMP) program runs in 9.7 seconds.

          Here's another experiment to try on either Windows or Linux. On Windows, I confgured for Time-Based Profiling. I collected profile data for both the single threaded and dual threaded (OpenMP) versions of the program. I used the "Separate CPUs" option in the view configuration dialog box (click "Manage" to get there) in order to separate the timer samples by CPU. For the single threaded program, I got the following timer sampes on core 0 and core 1:

              amdk8.sys                11038       5640

              matrix_interchange    5476     11043

          Windows recheduled the matrix multiply program between core 0 and 1, which produced the uneven distribution of samples between the two cores. On Windows, amdk8.sys is the idle loop, so you can see that each core was idle part of the time.

          For the dual thread program, I got the following distribution of timer samples between core 0 (column 1) and core 1 (column2):

              matrix_omp         9199      9256

              amdk8.sys             307        182

          This is a pretty even split between the two cores since there were two threads that kept both cores busy. Further, the idle loop (amdk8.sys) didn't get very many timer samples at all!

          I hope this helps you to troubleshoot your program.

          -- pj