xCOPY() poor performance vs pgcc

xCOPY() demonstrates poor performance compared to code generated by PGI's pgcc 7.2

I was curious about the performance of memcpy() vs xCOPY() from ACML.

My test case, stream.c from: and wrote a version that uses ACML or memcopy(),  I am using the PGI 7.2 compiler and their shipped acml (4.1.0 from acmlversion()  The hardware platform is an opt2218,

copy results:

pgcc stream.c -fastsse  (C code only)

5737.7932 MB/s

pgcc stream.c -fastsse -lacml -DTUNED (dcopy())

5727.4169 MB/s

pgcc stream.c -fastsse -lacml -DTUNED -DMEMCPY  (use memcpy())

3056.8649 MB/s


So the compiler makes as good of code as ACML thats great!  but the kicker is when you enable threading. The compiler does a better job of effitivly using both the memory controlers and increasing memory copy performance from C code and ACML does not appear to do anything,


pgcc stream.c -fastsse -mp  (C code with OpenMP)

10934.6766 MB/s 2 threads

11822.7677 MB/s 4 threads


pgcc stream.c -fastsse -mp -lacml_mp -DTUNED (dcopy() with OpenMP)

4525.8578 MB/s 2 threads

5177.1047 MB/s 4 threads


I should point out the C code is just:


#pragma omp parallel for
        for (j=0; j<N; j++)
            c[j] = a[j];

Is there a reason why ACML does not support threaded dcopy() to use multiple memory controlers?  I relize that at small array sizes threading overhead is a problem but the OpenMP functions allows userspace code to modify the number of threads like:


if (N < 1000)