First, thanks to everyone on this forum. You guys have been a great help to someone like myself just learning.
I have a question which, I suppose, is kind of a theoretical one regarding data locality and access within kernels.
I have ran two different Matrix-Vector Multiplication codes using the GPU as the device back-end.
The first code uses Compressed Row Storage (CRS) Format and "reshuffles" the data by padding zeros to arrays that are then passed to Streams and the 'multiply', and 'reduce' kernels to get the solution. I compare the execution times with the standard CPU matrix-vector multiplication and the GPU ran behind the CPU regardless of the size of matrix/vector.
The other code does NOT use Compressed Row Storage (CRS) Format and is just passed straight to the GPU Kernel (i.e., the "simple_mat_mult" example code). Comparing the GPU time to the CPU this time behaves as expected - GPU slower for smaller matrix/vector but gets better than CPU as the size increases.
I have not found much information about this behaviour but think it can be due to a couple of things. The first is maybe the Brook+ code, perhaps the gather and reduce operations have a lot of hidden overhead. The other reason may be because the data being passed to the Kernels via the Compressed Row Storage Format may be too non-contigous - data locality where the cache on the CPU has an advantage?
Any ideas/opinions?
Once again, thank you.