cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dinaharchery
Journeyman III

Matrix-Vector Multiplication on GPU

Translating Mat-Vec Mult to GPU

Please, will some kind soul help me?

I am a newbie to GPU Programming and am trying to compare performance between CPU and GPU for my own research. I have some pre-existing C++ code to multiply a matrix and vector and am trying to convert it to execute on the GPU.

I have looked at the sparse matrix-vector multiplication examples but am afraid that I still am a bit confused.  When I try to pass the array values from my C++ code to kernel code via streams for a gather operation (as shown in the sample) all indices returned are zero.

My original C++ code for matrix-vector multiplication is attached and has the following caveats:

1. Stored in CSR format

2. 'ahat' is all non-zero values from original 'a' matrix - a 2D matrix stored as 1D vector physically

3. The 'p' vector is the x vector in y = Ax

4. The 'u' vector is y vector in y = Ax

5. The 'csrRows' and 'csrCols' are the row and column pointers in CSR structure.

I goes without saying that anyone that helps me is GREAT! Thanks in advance

double Parameters::matVecMult(int nn, float *&p, float *&u){ try{ if(nn <= 0){ throw FERTMException("Exception matVecMult(): Invalid number of Nodes!\n"); } double time = 0.0; Start(1); for(int i = 0; i < nn; i++){ float t = 0.0f; int lb = csrRows; int ub = csrRows[i + 1]; for(int j = lb; j < ub; j++){ int index = csrCols; t += ahat*p[index]; }//INNER FOR-LOOP u = t; }//OUTER FOR-LOOP Stop(1); time = GetElapsedTime(0); return time; }catch(...){ throw FERTMException("Exception matVecMult(): Something went WRONG!\n"); } }//matVecMult()

0 Likes
4 Replies
the729
Journeyman III

Hi, dinaharchery,

We have been working on CSR matrix - vector multiply (single-precision float) recently. The speed-up is roughly 23x compared with single core implementation on Phenom 9550 CPU.

However, the speedup is only seen when the matrix is large enough, say N_nz = 10million and N_row = 1million.

There is no speedup when N_row is less then 10000.

0 Likes

Thank you for the reply.

The performance you mentioned seems consistant with papers I have read. You think that is because of data-reuse or communication latency via the PCI Bus?

Anyway, I still would like to implement the C++ code in GPU. Can you help?

 

Thanks again

0 Likes

We do not count any data transmission latency between system memory and graphic memory.

The bottle neck is gathering data in X using the Column indices, since the memory access is not continuous and the cache hit rate is very low.

If you plan to implement only the SpMV operation on GPU and transfer data between system and the card everytime, I personally do not think this a good idea. When accelerating an algorithm on GPU, we should try to keep all data in the graphic memory as long as possible, and try the best to reduce transmission.

In other words, implementing the whole algorithm on GPU is much more beneficial than implementing part of it. We do not count the transmission latency when measuring performance because we assume that SpMV is only part of a GPU algorithm instead of a CPU one.

0 Likes

Thanks for the info.

I can see how the gathering of x using column indices is an issue since the accesses are not continuous - I am running into this problem. The matrix-vector algorithm is to be used within the execution of a conjugate gradient solver. I decided to implement that part of it on GPU since it seems to be the performance bottle-neck.

Maybe I should try and do the entire solver on the GPU - although this would be no easy task.

 

0 Likes