I just "improved" my OpenCL program by removing a tiny function, which is actually a legacy of its C++ CPU predecessor. The function was called in the very inner loop.
The kernel cost 87.89 ms with the tiny function, and 35.55 ms without the function (performing the actions directly).
I was told that all OpenCL functions are inline, which explains why OpenCL does not allow recursions. Inline functions should not cause much overhead.
What does OpenCL really do when a function is called in OpenCL? Should functions be avoided as much as possible?
Any suggestions will be appreciated.
The strange thing happened on Thursday night. I replaced a tiny function with direct operations, and the running time was shortened dramatically.
I took time trying to repeat the legendary process on Friday, but I did not have the luck again. Neither could I use the same technique to speed up other parts of my program.
So, please forget it. OpenCL is working as it is supposed to.
Thank you for replying, and have a good weekend!
I am sorry to say that the experiment is not repeatable, as I explained above. Please forget it.
Thank you for your kind reply and have a good weekend!
I noticed that some of my functions were slower than directly including the code when I forgot to mark the input-only parameters as "const". But since I added that it's same speed.
Yes, my changes included from
uint4 * const res
for returning results. For other pointers I added "restrict", like
uint * restrict base
But I did not test each change for performance, so I cannot tell if that made a difference.
I think, if you know how the parameters are used (and you should 😉 ), then giving these hints to the compiler will never hurt. As a minimum it will make life easier for the optimizer, and at best it allows for optimizations that would not be done otherwise.