cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

laura88
Journeyman III

multithreading spmv (matrix in CSR)

Hi,

I found different sources of kernel code to do the sparse matrix vector multiplication (when the matrix is compressed with the CSR format), but I don't get the expected result, and I don't understand why.

Here is my kernel code

__kernel void spmv(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

{

          const int items_per_row = 32;

          const int idx = get_global_id(0)/items_per_row;

          if (idx >= num_rows) return;

 

          float sum = 0;

          int row = idx;

          int s = row_ptr[row];

          int e = row_ptr[row+1];

          const int local_id = get_local_id(0);

          const int local_row_id = local_id/items_per_row;

          const int local_row_offset = local_id%items_per_row;

          for (int i = s + local_row_offset; i<e; i+=items_per_row)

          {

sum += values*vec_in[col_idx];

          }

  red_mem[local_id] = sum;

           barrier(CLK_LOCAL_MEM_FENCE);

          //reduction step

          if (local_row_offset < 16) red_mem[local_id] += red_mem[local_id + 16];

          if (local_row_offset < 😎 red_mem[local_id] += red_mem[local_id + 8];

          if (local_row_offset < 4) red_mem[local_id] += red_mem[local_id +4];

          if (local_row_offset < 2) red_mem[local_id] += red_mem[local_id + 2];

          if (local_row_offset < 1) red_mem[local_id] += red_mem[local_id + 1];

          if(local_row_offset==0)

          {

                    vec_out[row] += red_mem[local_id];

          }

}

and here is how I launch the kernel (blockSize is initialized to 32)

cl::Event ndrEvent;

          int sRes = wRes *hRes;

          if ((sRes%blockSize)!=0)

                    sRes = sRes +blockSize-(sRes%blockSize);

cl::NDRange globalSize(sRes,1);  //sRes is the size of the output vector

cl::NDRange localSize(blockSize,1);

//I set the kernel's arguments

err = queue.enqueueNDRangeKernel (

                    kernel,

                    cl::NullRange,

                    globalSize,

                    localSize,

                    NULL,

  &ndrEvent);

I'm kind of desperate, so if someone could help me, it would be great !!

Thanks in advance

0 Likes
3 Replies
notzed
Challenger

Your code has multiple parallelism mistakes, you need to use barriers in the parallel reduction steps.

See here:

http://devgurus.amd.com/message/1279900#1279900

Also, AFAIK, every thread should go through the same barrier - you can't early exit.  Using a CPU target should get the compiler to spit here, or at least check your usage is ok (iirc).

(BTW compiling with the cpu will probably make it work, since it doesn't parallelise workgroups, but that doesn't mean the code is correct).

0 Likes

Ok, I make changes according to what I understood you said, but it still doesn't give me the expected result (using a CPU target gives me the same wrong result, just slower).

Here is my corrected code:

__kernel void spmv_openCL_simple(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

{

  int items_per_row = 32;

  const int idx = get_global_id(0)/items_per_row;

 

          if (idx <num_rows)

          {

                         float sum = 0;

                         int row = idx;

                         int s = row_ptr[row];

                         int e = row_ptr[row+1];

       const int local_id = get_local_id(0);

       const int local_row_id = local_id/items_per_row;

       const int local_row_offset = local_id%items_per_row;

                         for (int i = s + local_row_offset; i<e; i+=items_per_row)

                         {

                                        sum += values*vec_in[col_idx];

                         }

 

                         //accumulation step

                         red_mem[local_id] = sum;

                          barrier(CLK_LOCAL_MEM_FENCE);

 

                         //reduction step

                            while (items_per_row >1)

                          {

                                         items_per_row>>=1;

                                         if(local_row_offset < items_per_row) red_mem[local_id] += red_mem[local_id + items_per_row];

                         barrier(CLK_LOCAL_MEM_FENCE);

                          }

                         if(local_row_offset==0)

                         {

                                   vec_out[row] += red_mem[local_row_id*items_per_row]; // red_mem[local_row_id*items_per_row];

                         }

               }

}

Could the problem come from a bad initialisation of the global work size et the local work size ?

Thanks again

0 Likes

hmmm, well you must have some algorithmic problem then.  It could be to do with the global/lws, but i don't know enough about the algorithm to determine that by cursory inspection.

I can only suggest progressively simplifying the algorithm until you get it working and then adding back the paralellism afterwards.  This is what I do ...

e.g. replacing the parallel reduction step with something like

if (local_row_offset = 0) {

  sum=0;

   for (int i=0;i<32;i++)

     sum += add element;

}

or using printf to print out intermediate results (limit the size of the problem so this is decipherable).

0 Likes