We recently acquired a Radeon HD 6970 to make portability tests of BigDFT http://inac.cea.fr/L_Sim/BigDFT/ an electronic structure simulation code.
Porting the code was rather straightforward and with minor optimizations the code performs similarly on NVIDIA and AMD GPUS. As optimizations on the Cayman architecture have not been thorough we hope to have the Radeon HD 6970 outperform the TESLA C2070 we have available. (The whole OpenCL part is double precision only).
Nonetheless we saw a serious performance regression using 11.6 catalyst drivers (linux x86_64). Some kernel suffer a 10 % performance compared to the 11.5 version. It seems related to memory accesses, as problems seem worst with certain problem sizes. Some Kernels seem to benefit slightly from the upgrade though (about 1-5%).
Do you have a better insight in what is causing this regression? Is there a way I could help find a solution?
Thanks in advance
Edit : my first assumption was apparently false, the problem seems related to the auto vectorizer in the compiler:
previously a sequence of :
double tt = 0.0;
tt = fma( *buff++, CONST0, tt);
tt = fma( *buff++, CONST1, tt);
tt = fma( *buff++, CONST15, tt);
was automatically vectorized (I presume). Using:
double2 tt2 = (double2)(0.0, 0.0);
__local double2 *buff2= (__local double2 *)buff;
tt2 = mad(*tmp2++, (double2)(CONST0,CONST1), tt2);
tt2 = mad(*tmp2++, (double2)(CONST2,CONST3), tt2);
tt2 = mad(*tmp2++, (double2)(CONST14,CONST15), tt2);
double tt = tt2.x + tt2.y;
gave the performance back. I liked the previous behavior better.