I have a kernel that with a seemingly minor code change, runs 6X slower. The original code is something like:
kernel method1(int *A, int*B, int flag) {
int *locX;
int *locY;
if (flag) {
locX = A;
locY = B;
}
else {
locX = B;
locY = A;
}
major loop over locX and locY
}
I tried changing this to take to flag outside of the kernel, and it runs 6X slower:
kernel method2(int *A, int *B) {
int *locX;
int *locY;
locX = A;
locY = B;
major loop over locX and locY;
}
and I just vary the call of method2 on the host as,
if (flag) {
method2(A, B);
}
else {
method2(B, A);
}
Not sure why this should affect the performance in a negative direction, and this was the last conditional, and with method2, there are now no conditionals. I am running this on Linux x86_64 using an ATI 5870, and I don't know of any profiler tools that would let me see what is going on. I have tried all sorts of global counts: 1280, 2560, 10240, 20480, 40960, with item counts of 64 and 256, 10240 global and 256 item count appears to work best.
Any insight into the factors affecting performance between method1 and method2, or tools available for Linux x86_64 would be great.
First, to be accurate there is still the loop conditional in method2. Also, I tried putting some arbitrary conditional in method2, and the timing went back to normal. Also, putting the ternary operator to set locA and locB in method1 acted to degrade its performance by 6X.
sometime better performance could achieved whit
int a=(flag)?1:0;
int b=(flag)?0:1;
locX=a*A+b*B;
locY=b*A+a*B;
or
locX=(flag)?A:B;
locY=(flag)?B:A;
So, to summarize, method1 takes 24 seconds to run in my test harness, method2 takes 173 seconds. I tried using CPU instead of GPU, now both methods take 10 seconds.
tractus,
It is really hard to say anything without your code.
Please send in your code or a test case.
Hi Himanshu,
I found something that made both forms of my method run the same speed. I had 3 arrays that were not int, 2 were unsigned char, 1 was unsigned short. I changed all of them to int, and the time on both methods dropped to 16 sec in my test harness, and are now running at the same rate.
hi tractus,
Its good to hear that you debugged it yourself.
All the best