Archives Discussions

laura88 · ‎03-27-2012

Hi,

I found different sources of kernel code to do the sparse matrix vector multiplication (when the matrix is compressed with the CSR format), but I don't get the expected result, and I don't understand why.

Here is my kernel code

__kernel void spmv(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

{

const int items_per_row = 32;

const int idx = get_global_id(0)/items_per_row;

if (idx >= num_rows) return;

float sum = 0;

int row = idx;

int s = row_ptr[row];

int e = row_ptr[row+1];

const int local_id = get_local_id(0);

const int local_row_id = local_id/items_per_row;

const int local_row_offset = local_id%items_per_row;

for (int i = s + local_row_offset; i<e; i+=items_per_row)

{

sum += values*vec_in[col_idx];

}

red_mem[local_id] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

//reduction step

if (local_row_offset < 16) red_mem[local_id] += red_mem[local_id + 16];

if (local_row_offset < 😎 red_mem[local_id] += red_mem[local_id + 8];

if (local_row_offset < 4) red_mem[local_id] += red_mem[local_id +4];

if (local_row_offset < 2) red_mem[local_id] += red_mem[local_id + 2];

if (local_row_offset < 1) red_mem[local_id] += red_mem[local_id + 1];

if(local_row_offset==0)

{

vec_out[row] += red_mem[local_id];

}

and here is how I launch the kernel (blockSize is initialized to 32)

cl::Event ndrEvent;

int sRes = wRes *hRes;

if ((sRes%blockSize)!=0)

sRes = sRes +blockSize-(sRes%blockSize);

cl::NDRange globalSize(sRes,1); //sRes is the size of the output vector

cl::NDRange localSize(blockSize,1);

//I set the kernel's arguments

err = queue.enqueueNDRangeKernel (

kernel,

cl::NullRange,

globalSize,

localSize,

NULL,

&ndrEvent);

I'm kind of desperate, so if someone could help me, it would be great !!

Thanks in advance

notzed · ‎03-27-2012

Your code has multiple parallelism mistakes, you need to use barriers in the parallel reduction steps.

See here:

http://devgurus.amd.com/message/1279900#1279900

Also, AFAIK, every thread should go through the same barrier - you can't early exit. Using a CPU target should get the compiler to spit here, or at least check your usage is ok (iirc).

(BTW compiling with the cpu will probably make it work, since it doesn't parallelise workgroups, but that doesn't mean the code is correct).

laura88 · ‎03-28-2012

Ok, I make changes according to what I understood you said, but it still doesn't give me the expected result (using a CPU target gives me the same wrong result, just slower).

Here is my corrected code:

__kernel void spmv_openCL_simple(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

{

int items_per_row = 32;

const int idx = get_global_id(0)/items_per_row;

if (idx <num_rows)

{

float sum = 0;

int row = idx;

int s = row_ptr[row];

int e = row_ptr[row+1];

const int local_id = get_local_id(0);

const int local_row_id = local_id/items_per_row;

const int local_row_offset = local_id%items_per_row;

for (int i = s + local_row_offset; i<e; i+=items_per_row)

{

sum += values*vec_in[col_idx];

}

//accumulation step

red_mem[local_id] = sum;

barrier(CLK_LOCAL_MEM_FENCE);

//reduction step

while (items_per_row >1)

{

items_per_row>>=1;

if(local_row_offset < items_per_row) red_mem[local_id] += red_mem[local_id + items_per_row];

barrier(CLK_LOCAL_MEM_FENCE);

}

if(local_row_offset==0)

{

vec_out[row] += red_mem[local_row_id*items_per_row]; // red_mem[local_row_id*items_per_row];

}

Could the problem come from a bad initialisation of the global work size et the local work size ?

Thanks again

notzed · ‎03-28-2012

hmmm, well you must have some algorithmic problem then. It could be to do with the global/lws, but i don't know enough about the algorithm to determine that by cursory inspection.

I can only suggest progressively simplifying the algorithm until you get it working and then adding back the paralellism afterwards. This is what I do ...

e.g. replacing the parallel reduction step with something like

if (local_row_offset = 0) {

sum=0;

for (int i=0;i<32;i++)

sum += add element;

}

or using printf to print out intermediate results (limit the size of the problem so this is decipherable).

Archives Discussions

multithreading spmv (matrix in CSR)