Archives Discussions

tractatus · ‎09-06-2010

Seemingly minor kernel change results in 6X kernel slowdown

I have a kernel that with a seemingly minor code change, runs 6X slower. The original code is something like:

kernel method1(int *A, int*B, int flag) {

int *locX;

int *locY;

if (flag) {

locX = A;

locY = B;

}

else {

locX = B;

locY = A;

}

major loop over locX and locY

}

I tried changing this to take to flag outside of the kernel, and it runs 6X slower:

kernel method2(int *A, int *B) {

int *locX;

int *locY;

locX = A;

locY = B;

major loop over locX and locY;

}

and I just vary the call of method2 on the host as,

if (flag) {

method2(A, B);

}

else {

method2(B, A);

}

Not sure why this should affect the performance in a negative direction, and this was the last conditional, and with method2, there are now no conditionals. I am running this on Linux x86_64 using an ATI 5870, and I don't know of any profiler tools that would let me see what is going on. I have tried all sorts of global counts: 1280, 2560, 10240, 20480, 40960, with item counts of 64 and 256, 10240 global and 256 item count appears to work best.

Any insight into the factors affecting performance between method1 and method2, or tools available for Linux x86_64 would be great.

tractatus · ‎09-07-2010

First, to be accurate there is still the loop conditional in method2. Also, I tried putting some arbitrary conditional in method2, and the timing went back to normal. Also, putting the ternary operator to set locA and locB in method1 acted to degrade its performance by 6X.

Raistmer · ‎09-07-2010

(double post)

Raistmer · ‎09-07-2010

Looks like AMD OpenCL compiler gets crazy when more than 1-2 branches in kernel. Sometimes it generates few times longer code when just single return; added or commented out....

zeland · ‎09-07-2010

sometime better performance could achieved whit

int a=(flag)?1:0;

int b=(flag)?0:1;

locX=a*A+b*B;

locY=b*A+a*B;

or

locX=(flag)?A:B;

locY=(flag)?B:A;

tractatus · ‎09-08-2010

So, to summarize, method1 takes 24 seconds to run in my test harness, method2 takes 173 seconds. I tried using CPU instead of GPU, now both methods take 10 seconds.

himanshu_gautam · ‎09-08-2010

tractus,

It is really hard to say anything without your code.

Please send in your code or a test case.

tractatus · ‎09-08-2010

Hi Himanshu,

I found something that made both forms of my method run the same speed. I had 3 arrays that were not int, 2 were unsigned char, 1 was unsigned short. I changed all of them to int, and the time on both methods dropped to 16 sec in my test harness, and are now running at the same rate.

himanshu_gautam · ‎09-08-2010

hi tractus,

Its good to hear that you debugged it yourself.

All the best

Archives Discussions

Wildly variable kernel performance