3 Replies Latest reply on Mar 28, 2012 4:46 PM by notzed

    multithreading spmv (matrix in CSR)

    laura88

      Hi,

       

      I found different sources of kernel code to do the sparse matrix vector multiplication (when the matrix is compressed with the CSR format), but I don't get the expected result, and I don't understand why.

       

      Here is my kernel code

      __kernel void spmv(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

      {

                const int items_per_row = 32;

                const int idx = get_global_id(0)/items_per_row;

                if (idx >= num_rows) return;

       

                float sum = 0;

                int row = idx;

                int s = row_ptr[row];

                int e = row_ptr[row+1];

                const int local_id = get_local_id(0);

                const int local_row_id = local_id/items_per_row;

                const int local_row_offset = local_id%items_per_row;

                for (int i = s + local_row_offset; i<e; i+=items_per_row)

                {

      sum += values[i]*vec_in[col_idx[i]];

                }

       

       

        red_mem[local_id] = sum;

                 barrier(CLK_LOCAL_MEM_FENCE);

       

       

                //reduction step

                if (local_row_offset < 16) red_mem[local_id] += red_mem[local_id + 16];

                if (local_row_offset < 8) red_mem[local_id] += red_mem[local_id + 8];

                if (local_row_offset < 4) red_mem[local_id] += red_mem[local_id +4];

                if (local_row_offset < 2) red_mem[local_id] += red_mem[local_id + 2];

                if (local_row_offset < 1) red_mem[local_id] += red_mem[local_id + 1];

       

                if(local_row_offset==0)

                {

                          vec_out[row] += red_mem[local_id];

                }

       

      }

       

      and here is how I launch the kernel (blockSize is initialized to 32)

       

      cl::Event ndrEvent;

                int sRes = wRes *hRes;

                if ((sRes%blockSize)!=0)

                          sRes = sRes +blockSize-(sRes%blockSize);

       

      cl::NDRange globalSize(sRes,1);  //sRes is the size of the output vector

      cl::NDRange localSize(blockSize,1);

       

      //I set the kernel's arguments

       

      err = queue.enqueueNDRangeKernel (

                          kernel,

                          cl::NullRange,

                          globalSize,

                          localSize,

                          NULL,

        &ndrEvent);

       

      I'm kind of desperate, so if someone could help me, it would be great !!

       

      Thanks in advance

        • Re: multithreading spmv (matrix in CSR)
          notzed

          Your code has multiple parallelism mistakes, you need to use barriers in the parallel reduction steps.

           

          See here:

          http://devgurus.amd.com/message/1279900#1279900

           

          Also, AFAIK, every thread should go through the same barrier - you can't early exit.  Using a CPU target should get the compiler to spit here, or at least check your usage is ok (iirc).

           

          (BTW compiling with the cpu will probably make it work, since it doesn't parallelise workgroups, but that doesn't mean the code is correct).

            • Re: multithreading spmv (matrix in CSR)
              laura88

              Ok, I make changes according to what I understood you said, but it still doesn't give me the expected result (using a CPU target gives me the same wrong result, just slower).

               

              Here is my corrected code:

              __kernel void spmv_openCL_simple(__global float *values, __global int *col_idx, __global int *row_ptr, __global float* vec_in, __global float* vec_out, const int num_rows, __local float *red_mem)

              {

                int items_per_row = 32;

                const int idx = get_global_id(0)/items_per_row;

               

                        if (idx <num_rows)

                        {

                                       float sum = 0;

                                       int row = idx;

                                       int s = row_ptr[row];

                                       int e = row_ptr[row+1];

                     const int local_id = get_local_id(0);

                     const int local_row_id = local_id/items_per_row;

                     const int local_row_offset = local_id%items_per_row;

                                       for (int i = s + local_row_offset; i<e; i+=items_per_row)

                                       {

               

                                                      sum += values[i]*vec_in[col_idx[i]];

                                       }

               

                                       //accumulation step

                                       red_mem[local_id] = sum;

                                        barrier(CLK_LOCAL_MEM_FENCE);

               

                                       //reduction step

                                          while (items_per_row >1)

                                        {

                                                       items_per_row>>=1;

                                                       if(local_row_offset < items_per_row) red_mem[local_id] += red_mem[local_id + items_per_row];

                                       barrier(CLK_LOCAL_MEM_FENCE);

                                        }

               

                                       if(local_row_offset==0)

                                       {

                                                 vec_out[row] += red_mem[local_row_id*items_per_row]; // red_mem[local_row_id*items_per_row];

                                       }

                             }

              }

               

              Could the problem come from a bad initialisation of the global work size et the local work size ?

               

              Thanks again

                • Re: multithreading spmv (matrix in CSR)
                  notzed

                  hmmm, well you must have some algorithmic problem then.  It could be to do with the global/lws, but i don't know enough about the algorithm to determine that by cursory inspection.

                   

                  I can only suggest progressively simplifying the algorithm until you get it working and then adding back the paralellism afterwards.  This is what I do ...

                   

                  e.g. replacing the parallel reduction step with something like

                  if (local_row_offset = 0) {

                    sum=0;

                     for (int i=0;i<32;i++)

                       sum += add element;

                  }

                   

                  or using printf to print out intermediate results (limit the size of the problem so this is decipherable).