We have been working on CSR matrix - vector multiply (single-precision float) recently. The speed-up is roughly 23x compared with single core implementation on Phenom 9550 CPU.
However, the speedup is only seen when the matrix is large enough, say N_nz = 10million and N_row = 1million.
There is no speedup when N_row is less then 10000.
Thank you for the reply.
The performance you mentioned seems consistant with papers I have read. You think that is because of data-reuse or communication latency via the PCI Bus?
Anyway, I still would like to implement the C++ code in GPU. Can you help?
We do not count any data transmission latency between system memory and graphic memory.
The bottle neck is gathering data in X using the Column indices, since the memory access is not continuous and the cache hit rate is very low.
If you plan to implement only the SpMV operation on GPU and transfer data between system and the card everytime, I personally do not think this a good idea. When accelerating an algorithm on GPU, we should try to keep all data in the graphic memory as long as possible, and try the best to reduce transmission.
In other words, implementing the whole algorithm on GPU is much more beneficial than implementing part of it. We do not count the transmission latency when measuring performance because we assume that SpMV is only part of a GPU algorithm instead of a CPU one.
Thanks for the info.
I can see how the gathering of x using column indices is an issue since the accesses are not continuous - I am running into this problem. The matrix-vector algorithm is to be used within the execution of a conjugate gradient solver. I decided to implement that part of it on GPU since it seems to be the performance bottle-neck.
Maybe I should try and do the entire solver on the GPU - although this would be no easy task.