cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

drknoerig77
Journeyman III

Matrix multiplication ok this way?

I wrote a small matrix multiplication kernel for self-educational purposes:

/**

* @brief Matrix multiplication kernel.

* @param A The matrix as concencated row vectors.

* @param x Multiplicant vector (size: number of columns in A)

* @param y Result vector A*x (size: number of rows in A)

* @param nRows Number of rows in the matrix.

* @param nCols Number of columns in the matrix.

* @param xbuffer Local memory buffer for the multiplicant vector.

*/

__kernel void matrixMultiplication(__global double *A,__global double *x,__global double *y,unsigned int nRows,unsigned int nCols,__local double *xbuffer,__local double *resultBuffer)

{

    int column=get_local_id(0); // equals the column number

    int row=get_group_id(0); // fetch the workgroup number - should be the row number

    int globalID=get_global_id(0);

    xbuffer[column]=x[column];

    barrier(CLK_LOCAL_MEM_FENCE);

    resultBuffer[column]=xbuffer[column]*A[globalID];

    barrier(CLK_LOCAL_MEM_FENCE);

    double sum=0;

    if(column==0)

    {

        for(int k=0;k<nCols;k++) sum += resultBuffer;

        y[row]=sum;

    }

}

I call the kernel this way:

        cl::NDRange globalSize(A.getNumberOfRows()*A.getNumberOfColumns());
        cl::NDRange localSize(A.getNumberOfColumns());
        rcpp::KernelFunctor vektorSquareFunc(matrixMultiplicationKernel,queue,globalSize,localSize);
        vektorSquareFunc(cl::Buffer(A),
                         cl::Buffer(x),
                         cl::Buffer(y),
                         A.getNumberOfRows(),
                         A.getNumberOfColumns(),
                         rcpp::KernelFunctor::createLocalMemoryBuffer<double>(A.getNumberOfColumns()),
                         rcpp::KernelFunctor::createLocalMemoryBuffer<double>(A.getNumberOfColumns()));

It works basically.

My questions are:

     1. Is it ok to set the matrix column size as local/workgroup size argument?

     2. Is the usage of the memory fences and the local memory ok?

     3. How to handle the sum-up of the row-vector/vector product in a better way?

0 Likes
1 Solution
himanshu_gautam
Grandmaster

The code is fine.

1. Its okay if you set the column size as local/workgroup size.

2. handling of local memory and memory fences are okay.

3.  Currently you are considering each row as 1 workgroup. Instead You can improve the code performance by making multiple rows as 1 wg.

View solution in original post

0 Likes
3 Replies
himanshu_gautam
Grandmaster

The code is fine.

1. Its okay if you set the column size as local/workgroup size.

2. handling of local memory and memory fences are okay.

3.  Currently you are considering each row as 1 workgroup. Instead You can improve the code performance by making multiple rows as 1 wg.

0 Likes

Thank you very much for your answer!

Currently I'm in doubt if it was a good idea to cache the product vector in a local buffer. As far as I understand, the idea of workgroups is to split the global-size problems in

small sub-problems which can be handled by the device at once. If this is right, wouldn't OpenCL automatically process the row vectors in parallel if there are still unused processing units? For example, if my row vector size and therefore my workgroup size is 3 and my device has 8 compute units, wouldn't OpenCL start two workgroups in parallel? Here I have to admit that my understanding of OpenCL needs progress....

0 Likes

Yes offcourse all the workgroups work parallelly by the CUs. All the work items in a Wg (Workgroup) will also operate parallelly.But the point you should undetstand is once a Wg assigned to any CU, it should complete in that CU only. It is not possible to change in between.

0 Likes