Hello everyone,
I have an AMD Radeon R7 M260 GPU, 6 cores openCL, 3 dimensions, and 256 work-items for each dimension.
I created an OpenCL application for matrix multiplication, and am testing the performance of matrix multiplication operations of higher orders
(1000x1000, 2000x2000, 3000x3000 etcc ..). And then seeing some results came a question that confused me.
For example when I submit a multiplication of the matrix A (2000x2000) x B (2000x2000) = c (2000x2000) and use a cl :: NDRange localThreads (256, 1); execution terminates at ~ 85 seconds. Compared to the time of execution of the same example but now using the traditional sequential method runs on the CPU gain was very low, it was then that performed the form cl :: NDRange localThreads (16, 16) and the execution time went from ~85 seconds for ~3 seconds, I really was expecting due to parallel computation.
My question is, what actually happens at run time to change both the time of result? Could someone explain me better?
Here my kernel
__kernel void matvec_mult( __global float* matrixA, __global float* matrixB, __global float* matrizResult, int size) {
int i = (int)get_global_id(0);
int j = (int)get_global_id(1);
for(int k = 0; k<size; k++){
matrizResult[i*size+j] += matrixA[i*size +k] * matrixB[k*size+j];
}
}
Thanks
The textbook definition of matrix-matrix multiplication that you
are using is most unsuitable for both cpu and gpu calculations
unless the matrices are small.
Do a web search with the terms: tiled matrix multiplication
Implementing a tiled matrix-matrix multiplication on gpus for
matrices of sized other than powers of 2 is complicated
but the gains in performance are impressive.
--
Hello thanks for answering,
My question is regarding the difference in runtime, using the NDRange.
Why I got time results so different? 85 seconds using "NDRange localThreads (256, 1)" and 3 seconds using "NDRange localThreads (16, 16)"