Archives Discussions

dinaharchery · ‎12-21-2009

Compare CPU-GPU GFLops

Hello All,

I am studying the performance of GPU against CPU with regards to Matrix-Matrix/Vector multiplication (no compression format) and am getting some LARGE GFlops for the GPU. I must be computing the GFLops badly because I don't believe I should be getting upwards of 2470 GFlops for a simple Matrix-Vector multiplication ?

I am using a GPU with the following hardware design:

Graphics Card Manufacturer Powered by ATI
Graphics Chipset  ATI MOBILITY RADEON HD 4530 / 4570
Device ID   9553
Vendor    1002
Subsystem ID   02BE
Subsystem Vendor ID  1028
Graphics Bus Capability  PCI Express 2.0
Maximum Bus Setting  PCI Express 2.0 x16
BIOS Version   011.021.000.007
BIOS Part Number  BR32787-001
BIOS Date   2009/04/17
Memory Size   2045 MB
Memory Type   HyperMemory
Core Clock in MHz  680 MHz
Memory Clock in MHz  800 MHz
Number of Cores:  80 Unified

The code I am using to compute the GFlops is attached, can anyone tell me what I am doing wrong?

Setup(0);
// Start GPU Timer:
Start(0);
// Kernel Call - Matrix-Vector Multiplication:
simpleMatmult(m, S_m1, S_m2, S_realresult);
// Stop GPU Timer:
Stop(0);
gpuTime = GetElapsedTime(0);

double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024);

printf("Total GFlops = %f\n", gflop/gpuTime);

Setup(0); // Start GPU Timer: Start(0); // Kernel Call - Matrix-Vector Multiplication: simpleMatmult(m, S_m1, S_m2, S_realresult); // Stop GPU Timer: Stop(0); gpuTime = GetElapsedTime(0); double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024); printf("Total GFlops = %f\n", gflop/gpuTime);

gaurav_garg · ‎12-22-2009

streamRead and kernel execution are asynchronous methods and you can make them wait using Stream::finish() method. You should do something like this-

// If any stream that is passed to kernel, calls streamRead, wait for streamRead to finish S_m1.finish(); S_m2.finish(); Setup(0); // Start GPU Timer: Start(0); // Kernel Call - Matrix-Vector Multiplication: simpleMatmult(m, S_m1, S_m2, S_realresult); // wait for kernel to finish using finish method on any output stream S_realresult.finish(); // Stop GPU Timer: Stop(0); gpuTime = GetElapsedTime(0); double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024); printf("Total GFlops = %f\n", gflop/gpuTime);

dinaharchery · ‎12-22-2009

Thank you for the reply.

Other than the synchronous issue, I have the correctly computed GFlops for my machine?

I know that Brook+ has a limit of 4096 for a 1D and 4096x4096 for 2D arrays and I am testing Matrix-Vector Multiplication on the ATI (with a Square matrix and column vector).

Before the dimension gets to 4096 (i.e., column vector is 4096 and matrix is 4096x4096) the performance against the CPU is worse. However, at 4096 (and beyond) the GPU outperforms the CPU - significantly I might add. This is in stark contrast to Matrix-Matrix (square) multiplication on GPU, where the GPU begins to outperform the CPU somewhere between matrix dimensions of 128x128 - 256x256. I believe this is due to the fact that GPUs are optimized for 2D matrices - also perhaps the number of computations at these dimensions increases sufficiently?

I would like to narrow down the cause for this and have been unable to locate exact information on this. Do you know of any?

Once again, thank you.

rahulgarg · ‎12-22-2009

If you are doing a matrix-vector multiplication, then isnt the number of flops 2*n*m (assuming n*m matrix and m*1 vector) rather than 2*n*m*m?

dinaharchery · ‎12-23-2009

You are correct. Thank you.

dinaharchery · ‎12-23-2009

Can someone explain, given the above GPU hardware, I get such a dramatic shift in performance between a matrix dimension of 8192x8192 and 8193x8193?

For 8192x8192:

GPU Time = 0.172041, CPU Time = 0.354086, Speedup = 2.05815, GFLOPs = 0.72657

For 8193x8193:

GPU Time = 2.32571e-005, CPU Time = 0.299523, Speedup = 12878.7, GFLOPs = 5376

This performance of GPU continues to increase until matrix 15930x15930, where the system crashes. I thought that perhaps it is because after 8192 the GPU has filled enough data input to keep the ALUs busy?

Thanks.

empty_knapsack · ‎12-23-2009

More chances that 8192x8192 is a maximum for GPU and after these limits GPU routines doesn't starts at all but errors simply not handled properly.

dinaharchery · ‎12-23-2009

I believe you are correct. I ran the simple_mat_mult over a defined number of iterations (e.g., 1000) and adjusted the timing accordingly and got a more realistic number of GFLOPs as well as overall timing.

Is this due to a caching effect on the GPU? I know the GPU does not have a elaborate memory hiearchy like the CPU but it does have on-chip memory, registers, etc.

Maybe the first time a kernel is executed data is placed in texture memory and then afterwards used, so averaging several executions may give a more accurate performance?

Archives Discussions

Compute GFlops for Matrix-Vector GPU