Performance regression with 11.6

Discussion created by Kerilk on Jun 16, 2011
Latest reply on Jun 21, 2011 by Kerilk
Some kernels show performance loos up to 10% when using catalyst 11.6 compared to 11.5.

We recently acquired a Radeon HD 6970 to make portability tests of BigDFT an electronic structure simulation code.

Porting the code was rather straightforward and with minor optimizations the code performs similarly on NVIDIA and AMD GPUS. As optimizations on the Cayman architecture have not been thorough we hope to have the Radeon HD 6970 outperform the TESLA C2070 we have available. (The whole OpenCL part is double precision only).

Nonetheless we saw a serious performance regression using 11.6 catalyst drivers (linux x86_64). Some kernel suffer a 10 % performance compared to the 11.5 version. It seems related to memory accesses, as problems seem worst with certain problem sizes. Some Kernels seem to benefit slightly from the upgrade though (about 1-5%).

Do you have a better insight in what is causing this regression? Is there a way I could help find a solution?

Thanks in advance

Brice Videau



Edit : my first assumption was apparently false, the problem seems related to the auto vectorizer in the compiler:

previously a sequence of :

double tt = 0.0;

tt = fma( *buff++, CONST0, tt);

tt = fma( *buff++, CONST1, tt);


tt = fma( *buff++, CONST15, tt);

was automatically vectorized (I presume). Using:

double2 tt2 = (double2)(0.0, 0.0);
__local double2 *buff2= (__local double2 *)buff;
tt2 = mad(*tmp2++, (double2)(CONST0,CONST1), tt2);

tt2 = mad(*tmp2++, (double2)(CONST2,CONST3), tt2);


tt2 = mad(*tmp2++, (double2)(CONST14,CONST15), tt2);

double tt = tt2.x + tt2.y;

gave the performance back. I liked the previous behavior better.