Yes, I don't think AMD is as strong with this and NVidia has been adding in more functionality to the hardware to do DNN operations faster. Sadly I don't think we'll ever see such optimizations on AMD hardware. NVidia dominates the field with researchers so there is very little usage for those applications, or at least no serious public library/projects built on OpenCL. This is made worse by comparing what you can do with the CUDA language versus what you get with OpenCL 1.x (what most stuff supports although I'd bet it's 2.0 for all GPUs within a year). I really wish there were more extensions so we could have some of the features from CUDA broad into OpenCL - preferably over both AMD and NVidia - it's a double edged sword but for the right algorithms, it's worth it.
You are right that OpenCL will give you more hardware that you can run on. There's one hardware architecture you missed too that for some is a critical path to get to: FPGAs. Consider that your kernel sources will change sometimes with them but the language is OpenCL 1.0 and so there is often good reuse.
I absolutely agree on the benefits of OpenCL acceleration in these frameworks. I can't talk about where we're headed, but I can say this is a bleeding edge area that we are really interested in. In the community...
There is one I found for OpenCL in Torch here
And for Caffe here:
It looks like the Caffe work is just a request to check in basic OpenCL support, but work is underway in the community.
Hope this helps.