I work for one of the most important CGI FX film companies. We wanted to use OpenCL for some of our tools but we decided to move to CUDA instead of using OpenCL.
I just want to share with you some of the motives for this to help improve OpenCL in the future:
1. We need to use precompiled kernels by these motives:
A. We don't want to release our kernel source code.
B. We need to control our kernels across different driver versions in order to keep our register count, occupancy, performance and power consuption stable.
C. We write very optimized routines for very specific hardware ( Tesla S2050 ).
D. Our kernels are very complex and very numerous. Runtime-compilation is a mess for us because our apps will take too much time compiling the kernels.
We need a simpler way to precompiple kernels. Intel's IOC tool allows to precompile kernels using a button or command line. NVCC allows to use the command-line too. You provide and extension but we think you should modify your SKA tool in the same way that Intel's IOC.
2. You SKA tool is very buggy, outputs always N/A so it's not really useful. Also, the interface could be simplified and more intuitive.
3. We need C++ support. We use complex shaders and routines which are a mess to be implemented in simple C. You've started to support some C++ features like templates but I'm afraid we also need virtual functions and polymorphism, new/delete and some STL containers ( vector, list, map, etc... ).
4. NVIDIA provides ( in an official way ) very useful CUDA high-level libraries: thrust, npp, cudaRAND, cudpp, etc... They are very useful to perform reductions, sort, parallel scans, random number generation, FFT, BLAS, matrix ops, convolution and image processing, etc...
5. We need a visual AND command-line debugger where you can set breakpoints/asserts, inspect and modify variables, trace the stack, etc... CUDA provides nSight with very nice VS integration and also cuda-gdb-debug. Printf-based debugging restricted to CPU devicse is not enough...
6. We need better multiGPU support with advanced sync/memory features ( like CUDA's GPUDirect, shared memory space, DMA transfers, etc... )
7. You provide a decent profiler for Visual Studio... but... where's the linux and MacOs one? The CUDA's profiler is based on Qt and it's portable.
8. Finally but very important: NVIDIA uses a SMT scalar architecture which is good for complex GPGPU. You seem to prefer an old-style SIMD/VLIW one which causes a lot register flushes for very branched and complex code. I hope your GCN could improve that in the future.
And also a recommendation... to support standards is good for some cases ( OpenGL, OpenCL, etc... ) but when you work for a company with lots of resources ( or you develop for game consoles ) you want optimizations, not compatibility/productivitty. What I'm trying to say is that you should keep alive a GPGPU close-to-metal API because OpenCL is not designed for that.