5 Replies Latest reply on Aug 21, 2008 8:21 PM by eduardoschardong

    GFLOPS calculation and relationship with input size


      I have two questions.

      1. I am wondering why SDK sample includes streamRead and streamWrite time when calculating GLOPs? I admit it is one of GPU performance figure but it is nothing to do with GPU computing power. I believe many people don't include data transfer time when it comes to GFLOPs (including many research papers with CUDA).

      2. Considering a simple_matmult example in Brook SDK as an example, its GFLOPs increases as input size increases until some points and stay almost flat (saturation). I am wondering why it is so. Below is my data with 3870 X2. As you can see, GFLOPs is being saturated at around 2048*2048. Any thoughts are more than welcome. I don't think it is due to lack of threads.

      (input_matrix_size)   (GFLOPs)

      128*128    0.13293
      256*256    0.8292
      384*384    1.90313
      512*512    2.73721
      640*640    3.24699
      768*768    4.20032
      896*896    5.58104
      1024*1024    6.62299
      1152*1152    7.3679
      1280*1280    7.85343
      1408*1408    8.17544
      1536*1536    7.50756
      1664*1664    8.01407
      1792*1792    8.36521
      1920*1920    8.77919
      2048*2048    8.94158
      2176*2176    8.93896
      2304*2304    8.80759
      2432*2432    9.00353
      2560*2560    9.18522
      2688*2688    9.11353
      2816*2816    9.23887
      2944*2944    9.18581
      3072*3072    9.09957
      3200*3200    9.20422
      3328*3328    9.20185
      3456*3456    9.30503
      3584*3584    9.29831
      3712*3712    9.28772

        • GFLOPS calculation and relationship with input size
          1. Just my thought: Most papers don't include how they got the GFLOPS, which is, IMO, a HUGE problem in the academic community as many people will take traditional algorithms and tweak and tweak them to get maximum performance BUT the program they end up with often times does not give "real world" results. This is not the case for simple algorithms but can be the case for more complex algorithms. IMO, it is not worth considering the transfer time IF the transfers exist outside of the computation.

          2. The only thing I can think of is that the card is being saturated, that is, you are getting 100% occupancy. All the processors are being used for calculations.
            • GFLOPS calculation and relationship with input size

              If I measure time for only kernel function (I mean after streamRead() and before streamWrite()) I easily got more than 3 TFLOPs on HD3870 for large input matrix in simple_matmult SDK example. I think it doesn't make sense. Is this because of asynchronous operation between CPU and GPU? If so, how can I measure pure kernel time (gflops) excluding transfer time?

                • GFLOPS calculation and relationship with input size

                  streamRead and kernels are asynchronous calls -- they return immediately without waiting for the data transfer or the kernel execution to complete. streamWrite waits for the data transfer and the previous streamReads and kernels to complete.

                  So you would put a streamWrite for a small stream, say 1 element wide, and put the timer after that. The sequence will be 1. start timer, 2. call the kernel many times, say in a loop for 100 or 1000 iterations, 3. streamWrite 1 element, 4. end timer.

                  The samples in the SDK should have examples that use timers properly.