3 Replies Latest reply on Apr 29, 2008 3:01 PM by avk

    Transpose/copy slow on AMD Opteron?

      Performance comparison between Xeon/Opteron


      The transpose code below was compiled with gcc (using either of -O0 or -O3).  It ran between 3 - 5 times slower on Dual Core Opteron 2220 as compared to Dual Core Xeon 5160.  This was observed on both 64-bit Linux and 32-bit XP. What could be the reason for this, and what is the way to speed up the code?

      ======= Sample Code ===============

      #include <stdio.h>
      #include <sys/time.h>
      #include <memory.h>

      double timing(void)
          struct timeval tv;      /* Structure for storing time    */
          gettimeofday(&tv, NULL);
          return((double)tv.tv_sec + (double)tv.tv_usec*1e-6);

      double t;
      float *a,*b;
      float c, d;
      int i,j,size;

      size = 10000;
      a = (float *)malloc(size*size*sizeof(float));
      b = (float *)malloc(size*size*sizeof(float));

      t = timing();

      for(i = 0; i < size; i++){
       for(j = 0; j < size; j++){
        a[i*size + j] = b[i + j*size];

      printf("Time taken for size %i is: %lf\n",size,timing()-t);

      ============End sample code==================




        • Transpose/copy slow on AMD Opteron?
          Well, I can see that your code sample uses a large enough memory pieces, two by 400 MB, 800 MB in total. I think that your code sample is a hard work for Opterons, because they only have 1 MB of L2-cache per core, whereas Xeons have 4 MB of shared L2-cache per two cores. In the last case a one-threaded process can acquire full 4 MB for its needs. More then, Xeons have a more advanced hardware prefetch mechanism and 128-bit wide FPU (Opterons 22xx have 64-bit wide one). All of this, apparently, leads to results you got. Nevertheless, I believe that is possible to improve overall performance in your code sample by using technique named "Block Prefetch", which is described in the AMD Optimization Manual.
            • Transpose/copy slow on AMD Opteron?

              Thank you for the analysis AVK.  I will look up the block prefetch in the AMD manual.  Incidentally, I also noticed a factor of 2 or slower performance on the AMD even when matrix sizes become small enough (order of bytes to KB) to reside in the L2 cache.  Should I safely conclude that, in these cases (when the entire matrix is cache-resident), the two other reasons you mention, viz., (i) advanced prefetch and (ii) 128-bit wide FPU are the reasons for the Xeon outperforming the Opteron?



            • Transpose/copy slow on AMD Opteron?
              It's just my theory, but I think I should say: yes. You see, those CPU models you use, Opteron 2220 and Xeon 5160, are the members of different CPU generations, K8 (2003) and Conroe (2006). There are new models of Opteron exist (23xx and 83xx, K10 family (2007)); they have more advanced hardware prefetch mechanism (against K8), fewer cache latencies, L3 shared cache and 128-bit FPU. Do you have an ability to test these new Opterons? I think they can drastically improve the performance in any applications (including yours ) and maybe even outperform Xeons.
              BTW, did you try other compilers? I heard that Intel compiler for Linux is free of charge, and many people claim that this compiler is very efficient in the matter of speed. More then, there are other very good compilers exist: PathScale EKO, Portland Group. I mean that gcc is not only one.