Hi,
I found different sources of kernel code to do the sparse matrix vector multiplication (when the matrix is compressed with the CSR format), but I don't get the expected result, and I don't understand why.
Here is my kernel code
__kernel void spmv(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)
{
const int items_per_row = 32;
const int idx = get_global_id(0)/items_per_row;
if (idx >= num_rows) return;
float sum = 0;
int row = idx;
int s = row_ptr[row];
int e = row_ptr[row+1];
const int local_id = get_local_id(0);
const int local_row_id = local_id/items_per_row;
const int local_row_offset = local_id%items_per_row;
for (int i = s + local_row_offset; i<e; i+=items_per_row)
{
sum += values*vec_in[col_idx];
}
red_mem[local_id] = sum;
barrier(CLK_LOCAL_MEM_FENCE);
//reduction step
if (local_row_offset < 16) red_mem[local_id] += red_mem[local_id + 16];
if (local_row_offset < 😎 red_mem[local_id] += red_mem[local_id + 8];
if (local_row_offset < 4) red_mem[local_id] += red_mem[local_id +4];
if (local_row_offset < 2) red_mem[local_id] += red_mem[local_id + 2];
if (local_row_offset < 1) red_mem[local_id] += red_mem[local_id + 1];
if(local_row_offset==0)
{
vec_out[row] += red_mem[local_id];
}
}
and here is how I launch the kernel (blockSize is initialized to 32)
cl::Event ndrEvent;
int sRes = wRes *hRes;
if ((sRes%blockSize)!=0)
sRes = sRes +blockSize-(sRes%blockSize);
cl::NDRange globalSize(sRes,1); //sRes is the size of the output vector
cl::NDRange localSize(blockSize,1);
//I set the kernel's arguments
err = queue.enqueueNDRangeKernel (
kernel,
cl::NullRange,
globalSize,
localSize,
NULL,
&ndrEvent);
I'm kind of desperate, so if someone could help me, it would be great !!
Thanks in advance
Your code has multiple parallelism mistakes, you need to use barriers in the parallel reduction steps.
See here:
http://devgurus.amd.com/message/1279900#1279900
Also, AFAIK, every thread should go through the same barrier - you can't early exit. Using a CPU target should get the compiler to spit here, or at least check your usage is ok (iirc).
(BTW compiling with the cpu will probably make it work, since it doesn't parallelise workgroups, but that doesn't mean the code is correct).
Ok, I make changes according to what I understood you said, but it still doesn't give me the expected result (using a CPU target gives me the same wrong result, just slower).
Here is my corrected code:
__kernel void spmv_openCL_simple(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)
{
int items_per_row = 32;
const int idx = get_global_id(0)/items_per_row;
if (idx <num_rows)
{
float sum = 0;
int row = idx;
int s = row_ptr[row];
int e = row_ptr[row+1];
const int local_id = get_local_id(0);
const int local_row_id = local_id/items_per_row;
const int local_row_offset = local_id%items_per_row;
for (int i = s + local_row_offset; i<e; i+=items_per_row)
{
sum += values*vec_in[col_idx];
}
//accumulation step
red_mem[local_id] = sum;
barrier(CLK_LOCAL_MEM_FENCE);
//reduction step
while (items_per_row >1)
{
items_per_row>>=1;
if(local_row_offset < items_per_row) red_mem[local_id] += red_mem[local_id + items_per_row];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(local_row_offset==0)
{
vec_out[row] += red_mem[local_row_id*items_per_row]; // red_mem[local_row_id*items_per_row];
}
}
}
Could the problem come from a bad initialisation of the global work size et the local work size ?
Thanks again
hmmm, well you must have some algorithmic problem then. It could be to do with the global/lws, but i don't know enough about the algorithm to determine that by cursory inspection.
I can only suggest progressively simplifying the algorithm until you get it working and then adding back the paralellism afterwards. This is what I do ...
e.g. replacing the parallel reduction step with something like
if (local_row_offset = 0) {
sum=0;
for (int i=0;i<32;i++)
sum += add element;
}
or using printf to print out intermediate results (limit the size of the problem so this is decipherable).