Hello everyone,
I have an AMD Radeon R7 M260 GPU, 6 cores openCL, 3 dimensions, and 256 work-items for each dimension.
I created an OpenCL application for matrix multiplication, and am testing the performance of matrix multiplication operations of higher orders
(1000x1000, 2000x2000, 3000x3000 etcc ..). And then seeing some results came a question that confused me.
For example when I submit a multiplication of the matrix A (2000x2000) x B (2000x2000) = c (2000x2000) and use a cl :: NDRange localThreads (256, 1); execution terminates at ~ 85 seconds. Compared to the time of execution of the same example but now using the traditional sequential method runs on the CPU gain was very low, it was then that performed the form cl :: NDRange localThreads (16, 16) and the execution time went from ~85 seconds for ~3 seconds, I really was expecting due to parallel computation.
My question is, what actually happens at run time to change both the time of result? Could someone explain me better?
Here my kernel
__kernel void matvec_mult( __global float* matrixA, __global float* matrixB, __global float* matrizResult, int size) {
int i = (int)get_global_id(0);
int j = (int)get_global_id(1);
for(int k = 0; k<size; k++){
matrizResult[i*size+j] += matrixA[i*size +k] * matrixB[k*size+j];
}
}
Thanks