I am comparing App SDK/OpenCL and OpenMP at the time being. The app I'm studying spends most of its time doing exps and pows. With icpc (Intel C++ compiler) and OpenMP, all of these functions are translated into their vectorized equivalents (svml). But by the benchmark results (OpenCL version takes roughly 2x the time), these functions aren't translated to vectorized editions by the OpenCL compiler. Is there any way to achieve this? My experiments have been on Intel processors for now, would it make a difference if I tried on AMD CPUs?
Note that I am using the vector features in App SDK for the code in general, and I've seen that e.g. addition examples achieves a 2x speedup. However, this does not help much when maybe 40-50% of the time is spent doing exps and pows.
If this is not possible, I think it is a huge limitation of the App SDK. One could argue with the portability feature, but you often need to write different kernels for different devices anyways, and then I could just cook up some OpenMP code that does the same thing under some abstraction.
Yngve Sneen Lindal