AnsweredAssumed Answered

Questions about performance

Question asked by qqchose on Mar 25, 2013
Latest reply on Mar 26, 2013 by realhet

Hi

 

I have questions about optimization in OpenCL. I have a kernel. What this kernel does is not interesting for my questions. Then take this small example

 

kernel void myKernel(...)

{

       [CODE]

 

             output[i] = result;

}

 

 

My kernel needs 89.032 ms to complete. I need to call my kernel many times. Then, I thought it should be faster to loop X times in my kernel instead to call my kernel X times. Then, I tried this

 

kernel void myKernel(...)

{

       for(int i = 0; i < param.m_nbSample; ++i)

{

              [CODE]

output[i] += result;

}

}

 

I Tried to set param.m_nbSample egal to “1” to be sur evertything is fine. Everything is fine, but, my kernel need 224.633 ms to complete. 252% slower. Not what I expected. I tried to change m_nbSampe to see the result.

If m_nbSample egal 2, each sample need 224.719 ms by sample (loop)

If m_nbSample egal 4, each sample need 244.885 ms by sample (loop)

If m_nbSample egal 50, each sample need 220.442 ms by sample (loop)

I try to harcode the loop and use unroll

 

kernel void myKernel(...)

{

#pragma unroll 1

       for(int i = 0; i < 8; ++i)

{

              [CODE]

output[i] += result;

}

            

}

 

Each sample needs 243.236 ms by sample (loop).

It’s is normal to have this huge difference ? it’s a lot better to call my kernel X times instead to loop X times. What can explain this?

I tried something else.

I move my entire kernel in a function like this

 

void myKernelFunction(...)

{

[CODE]

output[i] += result;

               }

 

kernel void myKernel(...)

{

myKernelFunction(...)

            

}

 

Now my kernel need 93.7073 ms. I thought all function was “inlined” in kernel. Then I don’t understand why I already lost 4 ms just by adding a function. I tested and everything was fine. Then now I can test what happen if I call myKernelFunction X times. I tried 2 times.

 

kernel void myKernel(...)

{

myKernelFunction(...)

myKernelFunction(...)

            

}

 

The result: ... 972.904 ms by sample ( then 1945.81 ms for the entire kernel). 10 times slower! Worst then before with the “for”. What can cause this? What is supposed to be the best if I want to call my kernel X times?

 

Thanks

Outcomes