7 Replies Latest reply on Dec 23, 2009 1:34 PM by dinaharchery

    Compute GFlops for Matrix-Vector GPU

    dinaharchery
      Compare CPU-GPU GFLops

      Hello All,

      I am studying the performance of GPU against CPU with regards to Matrix-Matrix/Vector multiplication (no compression format) and am getting some LARGE GFlops for the GPU. I must be computing the GFLops badly because I don't believe I should be getting upwards of 2470 GFlops for a simple Matrix-Vector multiplication ?

      I am using a GPU with the following hardware design:

      Graphics Card Manufacturer Powered by ATI 
      Graphics Chipset  ATI MOBILITY RADEON HD 4530 / 4570 
      Device ID   9553 
      Vendor    1002  
      Subsystem ID   02BE 
      Subsystem Vendor ID  1028  
      Graphics Bus Capability  PCI Express 2.0 
      Maximum Bus Setting  PCI Express 2.0 x16  
      BIOS Version   011.021.000.007 
      BIOS Part Number  BR32787-001 
      BIOS Date   2009/04/17 
      Memory Size   2045 MB 
      Memory Type   HyperMemory  
      Core Clock in MHz  680 MHz 
      Memory Clock in MHz  800 MHz
      Number of Cores:  80 Unified

       

      The code I am using to compute the GFlops is attached, can anyone tell me what I am doing wrong?

      Setup(0);
      // Start GPU Timer:
      Start(0);
      // Kernel Call - Matrix-Vector Multiplication:
      simpleMatmult(m, S_m1, S_m2, S_realresult);
      // Stop GPU Timer:
      Stop(0);
      gpuTime = GetElapsedTime(0);

      double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024);

      printf("Total GFlops = %f\n", gflop/gpuTime);

      Setup(0); // Start GPU Timer: Start(0); // Kernel Call - Matrix-Vector Multiplication: simpleMatmult(m, S_m1, S_m2, S_realresult); // Stop GPU Timer: Stop(0); gpuTime = GetElapsedTime(0); double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024); printf("Total GFlops = %f\n", gflop/gpuTime);

        • Compute GFlops for Matrix-Vector GPU
          gaurav.garg

          streamRead and kernel execution are asynchronous methods and you can make them wait using Stream::finish() method. You should do something like this-

           

           

          // If any stream that is passed to kernel, calls streamRead, wait for streamRead to finish S_m1.finish(); S_m2.finish(); Setup(0); // Start GPU Timer: Start(0); // Kernel Call - Matrix-Vector Multiplication: simpleMatmult(m, S_m1, S_m2, S_realresult); // wait for kernel to finish using finish method on any output stream S_realresult.finish(); // Stop GPU Timer: Stop(0); gpuTime = GetElapsedTime(0); double gflop = (double)(2.0*n*m*m)/(double)(1024 * 1024 * 1024); printf("Total GFlops = %f\n", gflop/gpuTime);

            • Compute GFlops for Matrix-Vector GPU
              dinaharchery

              Thank you for the reply.

              Other than the synchronous issue, I have the correctly computed GFlops for my machine?

              I know that Brook+ has a limit of 4096 for a 1D and 4096x4096 for 2D arrays and I am testing Matrix-Vector Multiplication on the ATI (with a Square matrix and column vector). 

              Before the dimension gets to 4096 (i.e., column vector is 4096 and matrix is 4096x4096) the performance against the CPU is worse. However, at 4096 (and beyond) the GPU outperforms the CPU - significantly I might add.  This is in stark contrast to Matrix-Matrix (square) multiplication on GPU, where the GPU begins to outperform the CPU somewhere between matrix dimensions of 128x128 - 256x256.   I believe this is due to the fact that GPUs are optimized for 2D matrices - also perhaps the number of computations at these dimensions increases sufficiently?

              I would like to narrow down the cause for this and have been unable to locate exact information on this. Do you know of any?

              Once again, thank you.