I compared an OpenMP version of MatrixMultiplication with the OpenCL and naïve implementations provided in the SDK. Here are the benchmark results:
1) CPU original: 1211s
2) OpenCL: 253s
3) CPU cache friendly: 210s
4) CPU cache friendly + multithreading using OpenMP: 139s
The OpenCL(2) implementation is 4.8X faster than the reference implementation (1). (3) breaks down the matrix multiplication into 8x8 sub blocks like in (2) to improve memory locality and reduce cache misses. (3) is single threaded and already beats OpenCL by 1.2X. With multithreading using OpenMP (4) on dual core machine OpenCL is worse by 1.8X. I was multiplying 2048x2048 with 2048x2048.
Any idea why OpenCL is slower in this example?
I’m wondering how OpenCL threads are scheduled on the CPU. Is it guaranteed that a processor will complete one work group before moving on to threads in another workgroup? If not, that might explain some of the degradation.