Hello everyone,

I have an AMD Radeon R7 M260 GPU, 6 cores openCL, 3 dimensions, and 256 work-items for each dimension.

I created an OpenCL application for matrix multiplication, and am testing the performance of matrix multiplication operations of higher orders

(1000x1000, 2000x2000, 3000x3000 etcc ..). And then seeing some results came a question that confused me.

For example when I submit a multiplication of the matrix A (2000x2000) x B (2000x2000) = c (2000x2000) and use a cl :: NDRange localThreads (256, 1); execution terminates at ~ 85 seconds. Compared to the time of execution of the same example but now using the traditional sequential method runs on the CPU gain was very low, it was then that performed the form cl :: NDRange localThreads (16, 16) and the execution time went from ~85 seconds for ~3 seconds, I really was expecting due to parallel computation.

My question is, what actually happens at run time to change both the time of result? Could someone explain me better?

Here my kernel

__kernel void matvec_mult( __global float* matrixA, __global float* matrixB, __global float* matrizResult, int size) { int i = (int)get_global_id(0); int j = (int)get_global_id(1); for(int k = 0; k<size; k++){ matrizResult[i*size+j] += matrixA[i*size +k] * matrixB[k*size+j]; } }

Thanks

The textbook definition of matrix-matrix multiplication that you

are using is most unsuitable for both cpu and gpu calculations

unless the matrices are small.

Do a web search with the terms: tiled matrix multiplication

Implementing a tiled matrix-matrix multiplication on gpus for

matrices of sized other than powers of 2 is complicated

but the gains in performance are impressive.

--