I plan to code something in OpenCL, using überKernel pattern.
It means that a given kernel would have this structure:
__kernel void my_uber_kernel(void)
stage = stage + 1 ;
Each one of device_function_X() potentially contains a substantial amount of code.
I'm wondering if there is known limitations regarding the amount of instructions supported (per thread?) before performances are impacted ?
Does splitting process in small device functions calls help to optimize ?
Or do I have to split process in several kernel calls (so that above-mentioned device_function_X become kernels)