2 Replies Latest reply on Dec 17, 2008 11:39 PM by brockp

    xCOPY() poor performance vs pgcc

      xCOPY() demonstrates poor performance compared to code generated by PGI's pgcc 7.2

      I was curious about the performance of memcpy() vs xCOPY() from ACML.

      My test case, stream.c from: http://www.cs.virginia.edu/stream/ and wrote a version that uses ACML or memcopy(),  I am using the PGI 7.2 compiler and their shipped acml (4.1.0 from acmlversion()  The hardware platform is an opt2218,

      copy results:

      pgcc stream.c -fastsse  (C code only)

      5737.7932 MB/s

      pgcc stream.c -fastsse -lacml -DTUNED (dcopy())

      5727.4169 MB/s

      pgcc stream.c -fastsse -lacml -DTUNED -DMEMCPY  (use memcpy())

      3056.8649 MB/s


      So the compiler makes as good of code as ACML thats great!  but the kicker is when you enable threading. The compiler does a better job of effitivly using both the memory controlers and increasing memory copy performance from C code and ACML does not appear to do anything,


      pgcc stream.c -fastsse -mp  (C code with OpenMP)

      10934.6766 MB/s 2 threads

      11822.7677 MB/s 4 threads


      pgcc stream.c -fastsse -mp -lacml_mp -DTUNED (dcopy() with OpenMP)

      4525.8578 MB/s 2 threads

      5177.1047 MB/s 4 threads


      I should point out the C code is just:


      #pragma omp parallel for
              for (j=0; j<N; j++)
                  c[j] = a[j];

      Is there a reason why ACML does not support threaded dcopy() to use multiple memory controlers?  I relize that at small array sizes threading overhead is a problem but the OpenMP functions allows userspace code to modify the number of threads like:


      if (N < 1000)






        • xCOPY() poor performance vs pgcc
          Do you have an example of an application where multithreaded L1 and L2 BLAS functions will provide a significant performance benefit?

          One question is where the threading should occur in an application. For the L1 and L2 functions, are threading choices better left to the application?

          If we can demonstrate the benefit, this is something that could be added to the list of future enhancements.
            • xCOPY() poor performance vs pgcc

              I do not have any off the top of my head. My only example was the benchmark stream quoted above.

              Actually I was surpised,  I expect this effect to come from the location of data in memory on the numa system. Maybe this should be more of a case of memory location matters.

              For large data it would be kind of nice to do threaded L1 (mostly xCOPY and xSCALE) Thus I would not thread with values that are small.  Problem is, how does ACML know that data is spread across controlers and which threads should work on what data so that the performance of the multiple memory controlers could be exployted.