This is a known issue. Large OpenCL programs on the GPU can cause exponential increase in compilation time. The only known work-around is to use smaller kernels.
What is the recommended work around? How do we minimize compile time?
We need to understand the guide lines we can follow if we have lengthy pieces of code that needs to be executed.
Are there compile time trade offs between many short subroutines verses fewer longer subroutines?
Should we favor local variables stored in a structure, pass the structure to subroutines as a single argument, or pass variables directly as subroutine multiple parameters? What compiles faster?
On the GPU, function calls are not supported, so everything gets inlined, which causes the problems. The problem isn't how things are coded, but the fact that after everything gets inlined, the program itself can be extremely large. While our compiler pushes the inlining as far back as possible, there are still cases that will cause exponential increase in compile time, which is what you are seeing. Usually the increase is caused by the compiler using all of the memory and swapping to the hard drive.