cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

spyzer_abhishek0
Journeyman III

Drastically changing GPU behaviour on minor change in kernel

Hi all,

Here is a kernel which I am running on an ATI Mobility Radeon HD 4500:

__kernel void ker(__global int *A, __global int *B, int width)
{
    int tid = get_global_id(0);
   int a;
   for(int i=tid;i<width && i<tid+10000; i++){
        a = A+tid+B;
    }
    B[tid] = 9;
}

The total number of global work items (i.e width) = 128x128x128.

Now this kernel takes about 0.93 wall time as specified by the time command, BUT as soon as I change the second-last line of code to
                                                                                     B[tid] = a;
my desktop GUI hangs up for a certain time, and when it resumes the wall time displayed is 14.28 seconds.

What exactly is happening in this code???

Thank You.

0 Likes
7 Replies
dravisher
Journeyman III

The code won't compile in the Stream KernelAnalyzer, did you forget to address into the A and B pointers? Should it be a = A[ i ]+tid+B[ i ]; or something similar inside the loop maybe?

Anyway, it is not strange that the execution times goes up when you change from B[tid] = 9 to B[tid] = a. In the first case the variable a is never used, so the for-loop is probably removed (it isn't having any effect anyway). In the second case the for-loop can't be removed since a is used.

 

Edit: Just noticed that your text changet to italics in the middle of the code. You should use the "Attach Code" button instead of pasting it, since tags [] might be interpreted as something else by the forum system.

0 Likes

Yes sorry the code was,

a = A + tid + B;

hmm so it doesn't run the loop at all, that explains the speedy output. I thought it had something to do with the hardware. But in any case, why is running the loop taking so much time. Is it because of the memory access i have done in the loop A and B, or is it the running of a loop that takes time???

0 Likes

i add "" every time and its turned into italics....

0 Likes

ohhhooo i mean [ i ]

0 Likes

Yes all the accesses to global memory are going to be the bottleneck. Also should it be "a += ..."? With just "a =..." all iterations except the last one seem to do nothing.

0 Likes

No, this was just an example kernel doing nothing in particular but exposing the bottleneck you were talking about. So a was kept meaninglessly in here.

0 Likes

Originally posted by: spyzer.abhishek0 No, this was just an example kernel doing nothing in particular but exposing the bottleneck you were talking about. So a was kept meaninglessly in here.

 

In your first example, the loop is optimized out completely so you are just setting the destination buffer to 9.  When you change the code, you change the execution considerably because you are now executing the loop which uses up to 128x128x128 loops per thread and each loop iteration reads two values from memory so you are reading a lot of memory (up to 8MB per thread).  I assume that you really want the loop to execute so I assumed that you meant "a += ...".

While the GPU is busy, it can't be used to redraw your desktop (I assume you are using Linux because if it really took 14 seconds to execute, Windows would have TDR'ed).

Jeff

0 Likes