2 Replies Latest reply on Jul 14, 2016 11:31 AM by alexfd7

    Question about the time of execution


      Hello everyone,

      I have an AMD Radeon R7 M260 GPU,  6 cores openCL, 3 dimensions, and 256 work-items for each dimension.


      I created an OpenCL application for matrix multiplication, and am testing the performance of matrix multiplication operations of higher orders

      (1000x1000, 2000x2000, 3000x3000 etcc ..). And then seeing some results came a question that confused me.


      For example when I submit a multiplication of the matrix A (2000x2000) x B (2000x2000) = c (2000x2000) and use a cl :: NDRange localThreads (256, 1); execution terminates at ~ 85 seconds. Compared to the time of execution of the same example but now using the traditional sequential method runs on the CPU gain was very low, it was then that performed the form cl :: NDRange localThreads (16, 16) and the execution time went from ~85 seconds for ~3 seconds, I really was expecting due to parallel computation.


      My question is, what actually happens at run time to change both the time of result? Could someone explain me better?


      Here my kernel

      __kernel void matvec_mult( __global float* matrixA, __global float* matrixB, __global float* matrizResult, int size) {
         int i = (int)get_global_id(0);
         int j = (int)get_global_id(1);
         for(int k = 0; k<size; k++){
           matrizResult[i*size+j] +=  matrixA[i*size +k] * matrixB[k*size+j];



        • Re: Question about the time of execution

          The textbook definition of matrix-matrix multiplication that you

          are using is most unsuitable for both cpu and gpu calculations

          unless the matrices are small.


          Do a web search with the terms: tiled matrix multiplication


          Implementing a tiled matrix-matrix multiplication on gpus  for

          matrices of sized other than powers of 2 is complicated

          but the gains in performance are impressive.