Cache blocking, maybe? Why not try taking the OpenCL kernel, putting it in a quad nested loop (double for groups, double for work items in a group) so that it executes roughly the way OpenCL would execute it. Then put OpenMP directives on top of that to execute the workgroups in parallel.
I would imagine you'd see the same improvement you see with OpenCL, give or take. What usually happens is that OpenCL, being quite low level, forces you to change your code structure in a way that tends to lead to better performance.
I have confusions so lets take general questions:
Using local/private memory is said to give performance boost for GPUs, do they also improve performance in case of CPUs? Because I heard there is no OpenCL local memory for CPUs as it for GPUs.
And, making groups of work items is another OpenCL optimization, is it too GPU specific or improves performance on CPUs as well?
Thirdly, if I havent used OpenCL vectors, will AMD OpenCL do some auto-vectorization/SIMD for me or SIMD utilization only possible when I explicitly use OpenCL vectors? and also is this CPU specific or helps in GPUs too?
Thanks in advance
Cache locality can often be beneficial on CPUs. By writing your code to use local or private you might be restructuring your code for caching. It is no better than using a special region of global and making the same operations, but the need to copy in and out of that space can improve cache reuse.
Work groups are pointless on the CPU except in that you're better off doing a lot of work in one CPU thread rather than a tiny amount of work. You can replace the work group with a loop in a single work item. Don't just have a single work item that does a tiny amount of work on its own. Small amounts of work should not be mapped to threads: either have a work group or lots of work in a work item.
No, there is no auto vectorisation. Use vectors to get CPU vector ops. The GPUs are and shader compiler vectorised but using OpenCL vectors can still give benefits by increasing the amount of instruction level parallelism in a work item, allowing the compiler to fill VLIW packets more effectively and to make wider memory transactions that use the memory system more effectively.
In my case OpenCL performance on CPU does not even go near a highly optimized SSE2 code that runs on 1 cpu.