cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

akhal
Journeyman III

OpenCL Performance on CPUs

Hello

I have implemented few normal looping applications in OpenMP, TBB and OpenCL. In all these applications, OpeCL gives far better performance than others too when I am only running it on CPU with no specific optimizations done in kernels. OpenMP and TBB gives good performance too but far less than OpenCL, what could be reason for it because these both are CPU specialized frameworks and should gives at least a performance equal to OpenMP/TBB or should be less than them as it is more GPU oriented.

My second concern is that when it comes to OpenMP and TBB, OpenMP is always better in performance than TBB in my implementations in which I havent tuned it for a very good optimizations as I am not so expert. Is there a reason that OpenMP is normally better in performance than TBB? Because I think they both or even OpenCL too uses same kind of thread pooling at low level.... Any expert opinions? Thanks

0 Likes
10 Replies
rick_weber
Adept II

OpenCL has optimizations turned on by default. To disable them, you have to pass -cl-no-optimizations to the compiler (or something to that effect). If you don't have optimizations turned on in your TBB and OpenMP tests, then you're comparing optimized OpenCL code to unoptimized OpenMP code. That would account for the discrepancy.

0 Likes

Hi Akhal,

I agree with Rick.Weber as far as OpenCL is concerned. The perfomance of TBB and(or) OpenMP purely depends on your implementation. You can't generalize that OpenMP always out performs TBB. Maybe loadbalancing overhead in TBB is one of the reasons for this. Selecting the proper chunk size will also affect the performance.   

It would be easy for anyone to analyze if you could post your code snippets for both the cases.  

0 Likes

Thanks Mr nareshsankapelly

Yea you are right about chunksizes optimizations in case of OpenMP or TBB, but actually my codes dont specify any chunksizes in both cases; it uses "schedule(static)" in OpenMP and leaves chunksize estimation on compiler in TBB too by not specifying any size and use "auto-patitioner". In this case my all implementations of OpenMP outperform TBB, does that mean that the runtime scheduler of OpenMP is better?

0 Likes

You have to use "-cl-no-optimizations" flag in clBuildProgram function. 

0 Likes

I searched OpenCL specifications for clBuildProgram and its 4th argument is for optimizations which takes "-cl-opt-disable" flag to turn off all optimizations, but When I use this flag, I get "undefined -cl-opt-disable" error, doest this means my OpenCL SDK doesnt support this yet ? I am using the latest AMD SDK...

0 Likes

Originally posted by: akhal I searched OpenCL specifications for clBuildProgram and its 4th argument is for optimizations which takes "-cl-opt-disable" flag to turn off all optimizations, but When I use this flag, I get "undefined -cl-opt-disable" error, doest this means my OpenCL SDK doesnt support this yet ? I am using the latest AMD SDK...

 

I am able to use this flag without any problem.  Could you please us following information

OS, CPU, GPU, SDK version and Driver version.

0 Likes

Originally posted by: akhal I searched OpenCL specifications for clBuildProgram and its 4th argument is for optimizations which takes "-cl-opt-disable" flag to turn off all optimizations, but When I use this flag, I get "undefined -cl-opt-disable" error, doest this means my OpenCL SDK doesnt support this yet ? I am using the latest AMD SDK...

 

AFIK, It should work with latest SDK. I tried to use the same at my end. It is working fine.  

 

0 Likes

Thanks now it worked for me, I mistakenly write it straightaway there, while it should be wrapped in const C string. 🙂

0 Likes

Originally posted by: akhal Thanks Mr nareshsankapelly

 

Yea you are right about chunksizes optimizations in case of OpenMP or TBB, but actually my codes dont specify any chunksizes in both cases; it uses "schedule(static)" in OpenMP and leaves chunksize estimation on compiler in TBB too by not specifying any size and use "auto-patitioner". In this case my all implementations of OpenMP outperform TBB, does that mean that the runtime scheduler of OpenMP is better?

 

AFAIK, schedule(static) assigns chunks statically to threads. But, TBB does load balancing with auto_partitioner also. 

 

0 Likes

Thanks for the hints but I actually compiled all my applications with intel compiler and I passed -O0 option to turn off automatic optimizations by the compiler as "icc -O0 -g ....."  and I thought thats enough to stop compiler from optimizations by itself; if thats not enough, how do I use "-cl-no-optimizations" while compiling my code?

0 Likes