Archives Discussions

akhal · ‎07-27-2011

Hello

I have implemented a straightaway naive matrix multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for just an 8-core CPU system while I only run it on CPUs. I have applied some popular optimizations like utilizing private memory and local memory optimizations, and grouping my matrix in one dimension so I use both global and local dimension sizes. Now I get Speedup of around 24 with same 8-core CPU.
First I wonder this much speedup because for 8-cores I normally get around or less than 8 speedup with OpenMP for example. so these figures of 16 and 24 amaze me how its possible?
Second these local + private memory and grouping of work items are optimizations that I heard are only for GPUs and arent for CPUs so I again wonder how I get so much boost in speedup when I run it only on CPUs ?
Thirdly, I wonder how local and private memory and grouping are handled for CPUs as they cause speedup, caches or processor registers or what? Because this is magic to get so much speedup...

Please help me clarify because I am so new to OpenCL and its giving me so big performance I cant beleive it, I have verified results and they are perfectly accurate.
Thanks in advance

LeeHowes · ‎07-28-2011

Cache blocking, maybe? Why not try taking the OpenCL kernel, putting it in a quad nested loop (double for groups, double for work items in a group) so that it executes roughly the way OpenCL would execute it. Then put OpenMP directives on top of that to execute the workgroups in parallel.

I would imagine you'd see the same improvement you see with OpenCL, give or take. What usually happens is that OpenCL, being quite low level, forces you to change your code structure in a way that tends to lead to better performance.

akhal · ‎07-28-2011

I have confusions so lets take general questions:

Using local/private memory is said to give performance boost for GPUs, do they also improve performance in case of CPUs? Because I heard there is no OpenCL local memory for CPUs as it for GPUs.

And, making groups of work items is another OpenCL optimization, is it too GPU specific or improves performance on CPUs as well?

Thirdly, if I havent used OpenCL vectors, will AMD OpenCL do some auto-vectorization/SIMD for me or SIMD utilization only possible when I explicitly use OpenCL vectors? and also is this CPU specific or helps in GPUs too?

Thanks in advance

LeeHowes · ‎07-28-2011

Cache locality can often be beneficial on CPUs. By writing your code to use local or private you might be restructuring your code for caching. It is no better than using a special region of global and making the same operations, but the need to copy in and out of that space can improve cache reuse.

Work groups are pointless on the CPU except in that you're better off doing a lot of work in one CPU thread rather than a tiny amount of work. You can replace the work group with a loop in a single work item. Don't just have a single work item that does a tiny amount of work on its own. Small amounts of work should not be mapped to threads: either have a work group or lots of work in a work item.

No, there is no auto vectorisation. Use vectors to get CPU vector ops. The GPUs are and shader compiler vectorised but using OpenCL vectors can still give benefits by increasing the amount of instruction level parallelism in a work item, allowing the compiler to fill VLIW packets more effectively and to make wider memory transactions that use the memory system more effectively.

vladant · ‎08-01-2012

In my case OpenCL performance on CPU does not even go near a highly optimized SSE2 code that runs on 1 cpu.

Archives Discussions

CPU vs GPU optimiations