I have a Conjugate Gradient Solver written for CPU and am trying to translate the most computationally heavy aspect to use on GPU - i.e., Matrix-Vector multiplication. I have done this thanks to gaurav.garg, however the performance is just BAD. I looked at KernelStreamAnalyzer and it said the bottleneck is the texture fetches - assumed this is sort of assocated with Stream read/writes?
The data is unstructured so I won't know ahead of time the specific size of the matrix and vector that is being solved. Does anyone know of an effective way to reduce the need to call the Stream read/write? I like the idea of using the GPU as a coprocessor but the performance is horrid.
Applicable code is attached - sorry for the amount.
Any ideas would be wonderful. Thank you in advance.