
Compute GFlops for MatrixVector GPU
gaurav.garg Dec 22, 2009 4:39 AM (in response to dinaharchery)streamRead and kernel execution are asynchronous methods and you can make them wait using Stream::finish() method. You should do something like this
// If any stream that is passed to kernel, calls streamRead, wait for streamRead to finish S_m1.finish(); S_m2.finish(); Setup(0); // Start GPU Timer: Start(0); // Kernel Call  MatrixVector Multiplication: simpleMatmult(m, S_m1, S_m2, S_realresult); // wait for kernel to finish using finish method on any output stream S_realresult.finish(); // Stop GPU Timer: Stop(0); gpuTime = GetElapsedTime(0); double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024); printf("Total GFlops = %f\n", gflop/gpuTime);

Compute GFlops for MatrixVector GPU
dinaharchery Dec 22, 2009 11:15 AM (in response to gaurav.garg)Thank you for the reply.
Other than the synchronous issue, I have the correctly computed GFlops for my machine?
I know that Brook+ has a limit of 4096 for a 1D and 4096x4096 for 2D arrays and I am testing MatrixVector Multiplication on the ATI (with a Square matrix and column vector).
Before the dimension gets to 4096 (i.e., column vector is 4096 and matrix is 4096x4096) the performance against the CPU is worse. However, at 4096 (and beyond) the GPU outperforms the CPU  significantly I might add. This is in stark contrast to MatrixMatrix (square) multiplication on GPU, where the GPU begins to outperform the CPU somewhere between matrix dimensions of 128x128  256x256. I believe this is due to the fact that GPUs are optimized for 2D matrices  also perhaps the number of computations at these dimensions increases sufficiently?
I would like to narrow down the cause for this and have been unable to locate exact information on this. Do you know of any?
Once again, thank you.

Compute GFlops for MatrixVector GPU
rahulgarg Dec 22, 2009 6:35 PM (in response to dinaharchery)If you are doing a matrixvector multiplication, then isnt the number of flops 2*n*m (assuming n*m matrix and m*1 vector) rather than 2*n*m*m?

Compute GFlops for MatrixVector GPU
dinaharchery Dec 23, 2009 9:45 AM (in response to rahulgarg)You are correct. Thank you.

Compute GFlops for MatrixVector GPU
dinaharchery Dec 23, 2009 10:31 AM (in response to rahulgarg)Can someone explain, given the above GPU hardware, I get such a dramatic shift in performance between a matrix dimension of 8192x8192 and 8193x8193?
For 8192x8192:
GPU Time = 0.172041, CPU Time = 0.354086, Speedup = 2.05815, GFLOPs = 0.72657
For 8193x8193:
GPU Time = 2.32571e005, CPU Time = 0.299523, Speedup = 12878.7, GFLOPs = 5376
This performance of GPU continues to increase until matrix 15930x15930, where the system crashes. I thought that perhaps it is because after 8192 the GPU has filled enough data input to keep the ALUs busy?
Thanks.

Compute GFlops for MatrixVector GPU
empty_knapsack Dec 23, 2009 10:45 AM (in response to dinaharchery)More chances that 8192x8192 is a maximum for GPU and after these limits GPU routines doesn't starts at all but errors simply not handled properly.

Compute GFlops for MatrixVector GPU
dinaharchery Dec 23, 2009 1:34 PM (in response to empty_knapsack)I believe you are correct. I ran the simple_mat_mult over a defined number of iterations (e.g., 1000) and adjusted the timing accordingly and got a more realistic number of GFLOPs as well as overall timing.
Is this due to a caching effect on the GPU? I know the GPU does not have a elaborate memory hiearchy like the CPU but it does have onchip memory, registers, etc.
Maybe the first time a kernel is executed data is placed in texture memory and then afterwards used, so averaging several executions may give a more accurate performance?




