I wrote a small matrix multiplication kernel for self-educational purposes:

/**

* @brief Matrix multiplication kernel.

* @param A The matrix as concencated row vectors.

* @param x Multiplicant vector (size: number of columns in A)

* @param y Result vector A*x (size: number of rows in A)

* @param nRows Number of rows in the matrix.

* @param nCols Number of columns in the matrix.

* @param xbuffer Local memory buffer for the multiplicant vector.

*/

__kernel void matrixMultiplication(__global double *A,__global double *x,__global double *y,unsigned int nRows,unsigned int nCols,__local double *xbuffer,__local double *resultBuffer)

{

int column=get_local_id(0); // equals the column number

int row=get_group_id(0); // fetch the workgroup number - should be the row number

int globalID=get_global_id(0);

xbuffer[column]=x[column];

barrier(CLK_LOCAL_MEM_FENCE);

resultBuffer[column]=xbuffer[column]*A[globalID];

barrier(CLK_LOCAL_MEM_FENCE);

double sum=0;

if(column==0)

{

for(int k=0;k<nCols;k++) sum += resultBuffer[k];

y[row]=sum;

}

}

I call the kernel this way:

cl::NDRange globalSize(A.getNumberOfRows()*A.getNumberOfColumns());

cl::NDRange localSize(A.getNumberOfColumns());

rcpp::KernelFunctor vektorSquareFunc(matrixMultiplicationKernel,queue,globalSize,localSize);

vektorSquareFunc(cl::Buffer(A),

cl::Buffer(x),

cl::Buffer(y),

A.getNumberOfRows(),

A.getNumberOfColumns(),

rcpp::KernelFunctor::createLocalMemoryBuffer<double>(A.getNumberOfColumns()),

rcpp::KernelFunctor::createLocalMemoryBuffer<double>(A.getNumberOfColumns()));

It works basically.

My questions are:

1. Is it ok to set the matrix column size as local/workgroup size argument?

2. Is the usage of the memory fences and the local memory ok?

3. How to handle the sum-up of the row-vector/vector product in a better way?

The code is fine.

1. Its okay if you set the column size as local/workgroup size.

2. handling of local memory and memory fences are okay.

3. Currently you are considering each row as 1 workgroup. Instead You can improve the code performance by making multiple rows as 1 wg.