The textbook definition of matrix-matrix multiplication that you
are using is most unsuitable for both cpu and gpu calculations
unless the matrices are small.
Do a web search with the terms: tiled matrix multiplication
Implementing a tiled matrix-matrix multiplication on gpus for
matrices of sized other than powers of 2 is complicated
but the gains in performance are impressive.
Hello thanks for answering,
My question is regarding the difference in runtime, using the NDRange.
Why I got time results so different? 85 seconds using "NDRange localThreads (256, 1)" and 3 seconds using "NDRange localThreads (16, 16)"