brockp

xCOPY() poor performance vs pgcc

Discussion created by brockp on Sep 29, 2008
Latest reply on Dec 17, 2008 by brockp
xCOPY() demonstrates poor performance compared to code generated by PGI's pgcc 7.2

I was curious about the performance of memcpy() vs xCOPY() from ACML.

My test case, stream.c from: http://www.cs.virginia.edu/stream/ and wrote a version that uses ACML or memcopy(),  I am using the PGI 7.2 compiler and their shipped acml (4.1.0 from acmlversion()  The hardware platform is an opt2218,

copy results:

pgcc stream.c -fastsse  (C code only)

5737.7932 MB/s

pgcc stream.c -fastsse -lacml -DTUNED (dcopy())

5727.4169 MB/s

pgcc stream.c -fastsse -lacml -DTUNED -DMEMCPY  (use memcpy())

3056.8649 MB/s

 

So the compiler makes as good of code as ACML thats great!  but the kicker is when you enable threading. The compiler does a better job of effitivly using both the memory controlers and increasing memory copy performance from C code and ACML does not appear to do anything,

 

pgcc stream.c -fastsse -mp  (C code with OpenMP)

10934.6766 MB/s 2 threads

11822.7677 MB/s 4 threads

 

pgcc stream.c -fastsse -mp -lacml_mp -DTUNED (dcopy() with OpenMP)

4525.8578 MB/s 2 threads

5177.1047 MB/s 4 threads

 

I should point out the C code is just:

 

#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];

Is there a reason why ACML does not support threaded dcopy() to use multiple memory controlers?  I relize that at small array sizes threading overhead is a problem but the OpenMP functions allows userspace code to modify the number of threads like:

 

if (N < 1000)

  set_omp_threads(1)

  copy

else

 copy

end

Outcomes