4 Replies Latest reply on Sep 25, 2013 10:54 PM by himanshu.gautam

    Matrix multiplication ok this way?


      I wrote a small matrix multiplication kernel for self-educational purposes:



      * @brief Matrix multiplication kernel.

      * @param A The matrix as concencated row vectors.

      * @param x Multiplicant vector (size: number of columns in A)

      * @param y Result vector A*x (size: number of rows in A)

      * @param nRows Number of rows in the matrix.

      * @param nCols Number of columns in the matrix.

      * @param xbuffer Local memory buffer for the multiplicant vector.



      __kernel void matrixMultiplication(__global double *A,__global double *x,__global double *y,unsigned int nRows,unsigned int nCols,__local double *xbuffer,__local double *resultBuffer)


          int column=get_local_id(0); // equals the column number

          int row=get_group_id(0); // fetch the workgroup number - should be the row number

          int globalID=get_global_id(0);





          double sum=0;



              for(int k=0;k<nCols;k++) sum += resultBuffer[k];





      I call the kernel this way:

              cl::NDRange globalSize(A.getNumberOfRows()*A.getNumberOfColumns());
              cl::NDRange localSize(A.getNumberOfColumns());
              rcpp::KernelFunctor vektorSquareFunc(matrixMultiplicationKernel,queue,globalSize,localSize);


      It works basically.


      My questions are:

           1. Is it ok to set the matrix column size as local/workgroup size argument?

           2. Is the usage of the memory fences and the local memory ok?

           3. How to handle the sum-up of the row-vector/vector product in a better way?

        • Re: Matrix multiplication ok this way?

          The code is fine.


          1. Its okay if you set the column size as local/workgroup size.

          2. handling of local memory and memory fences are okay.

          3.  Currently you are considering each row as 1 workgroup. Instead You can improve the code performance by making multiple rows as 1 wg.

            • Re: Matrix multiplication ok this way?

              Thank you very much for your answer!

              Currently I'm in doubt if it was a good idea to cache the product vector in a local buffer. As far as I understand, the idea of workgroups is to split the global-size problems in

              small sub-problems which can be handled by the device at once. If this is right, wouldn't OpenCL automatically process the row vectors in parallel if there are still unused processing units? For example, if my row vector size and therefore my workgroup size is 3 and my device has 8 compute units, wouldn't OpenCL start two workgroups in parallel? Here I have to admit that my understanding of OpenCL needs progress....