Archives Discussions

dinaharchery · ‎10-05-2009

First, thanks to everyone on this forum. You guys have been a great help to someone like myself just learning.

I have a question which, I suppose, is kind of a theoretical one regarding data locality and access within kernels.

I have ran two different Matrix-Vector Multiplication codes using the GPU as the device back-end.

The first code uses Compressed Row Storage (CRS) Format and "reshuffles" the data by padding zeros to arrays that are then passed to Streams and the 'multiply', and 'reduce' kernels to get the solution. I compare the execution times with the standard CPU matrix-vector multiplication and the GPU ran behind the CPU regardless of the size of matrix/vector.

The other code does NOT use Compressed Row Storage (CRS) Format and is just passed straight to the GPU Kernel (i.e., the "simple_mat_mult" example code). Comparing the GPU time to the CPU this time behaves as expected - GPU slower for smaller matrix/vector but gets better than CPU as the size increases.

I have not found much information about this behaviour but think it can be due to a couple of things. The first is maybe the Brook+ code, perhaps the gather and reduce operations have a lot of hidden overhead. The other reason may be because the data being passed to the Kernels via the Compressed Row Storage Format may be too non-contigous - data locality where the cache on the CPU has an advantage?

Any ideas/opinions?

Once again, thank you.

riza_guntur · ‎10-06-2009

I have one thought in mind: Unnecessary data fetch

Have you check in SKA, I'm pretty sure it has bad ALU:Fetch ratio, even worse than simple_matmult

dinaharchery · ‎10-07-2009

I took your advice and used SKA and you are correct it appears to be the unnecessary data fetch. For example on one set of data I got an ALU/Fetch ratio of 36.71 for the "simple_matmult" and 0.33 for the kernel using the gather operation.

Do you know how data from sparse sets (such as sparse matrix-vector multiplication) is typically handled via the GPU?

I am using the ATI Mobility HD Radeon 4530/4570 and I don't believe it has any form of cache behavior which should alliveate this problem?

Thanks.

riza_guntur · ‎10-07-2009

Haven't you tried the samples? It is slow for such operation, even the sample has slow performance, since less reuse data is bad for GPU or we can say there is memory bottleneck

ATI GPU always good for operation that utilize high-ratio data reuse, else we might see little performance increase

Archives Discussions

Theoretical Question - contingous data problem?