I just "improved" my OpenCL program by removing a tiny function, which is actually a legacy of its C++ CPU predecessor. The function was called in the very inner loop.
The kernel cost 87.89 ms with the tiny function, and 35.55 ms without the function (performing the actions directly).
I was told that all OpenCL functions are inline, which explains why OpenCL does not allow recursions. Inline functions should not cause much overhead.
What does OpenCL really do when a function is called in OpenCL? Should functions be avoided as much as possible?
Any suggestions will be appreciated.