Please, will some kind soul help me?

I am a newbie to GPU Programming and am trying to compare performance between CPU and GPU for my own research. I have some pre-existing C++ code to multiply a matrix and vector and am trying to convert it to execute on the GPU.

I have looked at the sparse matrix-vector multiplication examples but am afraid that I still am a bit confused. When I try to pass the array values from my C++ code to kernel code via streams for a gather operation (as shown in the sample) all indices returned are zero.

My original C++ code for matrix-vector multiplication is attached and has the following caveats:

1. Stored in CSR format

2. 'ahat' is all non-zero values from original 'a' matrix - a 2D matrix stored as 1D vector physically

3. The 'p' vector is the x vector in y = Ax

4. The 'u' vector is y vector in y = Ax

5. The 'csrRows' and 'csrCols' are the row and column pointers in CSR structure.

I goes without saying that anyone that helps me is GREAT! Thanks in advance

double Parameters::matVecMult(int nn, float *&p, float *&u){ try{ if(nn <= 0){ throw FERTMException("Exception matVecMult(): Invalid number of Nodes!\n"); } double time = 0.0; Start(1); for(int i = 0; i < nn; i++){ float t = 0.0f; int lb = csrRows[i]; int ub = csrRows[i + 1]; for(int j = lb; j < ub; j++){ int index = csrCols[j]; t += ahat[j]*p[index]; }//INNER FOR-LOOP u[i] = t; }//OUTER FOR-LOOP Stop(1); time = GetElapsedTime(0); return time; }catch(...){ throw FERTMException("Exception matVecMult(): Something went WRONG!\n"); } }//matVecMult()

Hi, dinaharchery,

We have been working on CSR matrix - vector multiply (single-precision float) recently. The speed-up is roughly 23x compared with single core implementation on Phenom 9550 CPU.

However, the speedup is only seen when the matrix is large enough, say N_nz = 10million and N_row = 1million.

There is no speedup when N_row is less then 10000.