7 Replies Latest reply on Jun 20, 2011 5:03 AM by rick.weber

    How to measure GFLOPS?

    notyou

       

      Since I didn't get a response in one of the other sub-forums I figured I should post this here since my work is also relating to OpenCL.

       

      I've been looking through research papers to make a comparison between a number of architectures including CPUs, the Cell BE and GPUs and I see GFLOPS being used as a unit of measurement but it is never stated exactly how they get their measurements.

      Is it as simple as looping the program to execute for roughly a second (or more for greater accuracy I assume) and counting the number of operations (+, -, /, *, etc.) in the kernel?

      If so, why do I see numbers such as the average GFLOPS and peak GFLOPS? Continuing, how are these numbers determined?

      And, assuming it's not answered by the time you read this, how would I go about measuring GFLOPS for my own CPU and GPU for comparison (using a specific algorithm such as Mersenne Twister for example).

      Thanks.

      -Matt



        • How to measure GFLOPS?
          maximmoroz

          Peak GFLOPS is usually understood as the maximum number of floating point operations the hardware is able to perform assuming that all the data for these operations are directly available.

          For example, AMD R6950 reference card: 800MHz * 22 compute engines * 16 stream cores in each compute unit * 4 processing element in each stream core * 2 single precision flop (MULADD) = 2252,8 GFLOPS.

          Obviously, in real world the data needs to be copied closer to ALU units (host->global buffer->local buffer->register - this consumes processor ticks, the latency should be hidden also), part of ALU horsepower needs to be dedicated to some indexes calculation, cycles, conditions e t.c.

          Thus while the Peak GFLOPS might be huge one should consider the real performance of the real kernel, taking into account that this specific kernel might accidently (or intentionally) be optimized for the specific hardware.

          For eaxmple, I have a kernel which is able to perform multi-convolutions (each output image is a product of convolutions applied to a number of input images). First, I calculate all addition and multiplication operations it is required to do the convolution. I count not the operations I encounter in the kernel but the operations in the algorithm. Then I determine how much time it takes to run the kernel. Then I divide the 1st number by the 2nd. I got about 400-450 GFLOPS, which is 16%-18% of Peak GFLOPS for R6950 running at 900MHz.

            • How to measure GFLOPS?
              notyou

               

              Originally posted by: maximmoroz 

              For example, AMD R6950 reference card: 800MHz * 22 compute engines * 16 stream cores in each compute unit * 4 processing element in each stream core * 2 single precision flop (MULADD) = 2252,8 GFLOPS.

              This all makes sense.

               

              Originally posted by: maximmoroz 

              Obviously, in real world the data needs to be copied closer to ALU units (host->global buffer->local buffer->register - this consumes processor ticks, the latency should be hidden also), part of ALU horsepower needs to be dedicated to some indexes calculation, cycles, conditions e t.c.

              Thus while the Peak GFLOPS might be huge one should consider the real performance of the real kernel, taking into account that this specific kernel might accidently (or intentionally) be optimized for the specific hardware.

              Again, this all makes sense, and the comparisons I'll be making will be based on the fact that the algorithm will be tuned for each architecture.

               

              Originally posted by: maximmoroz 

              For eaxmple, I have a kernel which is able to perform multi-convolutions (each output image is a product of convolutions applied to a number of input images). First, I calculate all addition and multiplication operations it is required to do the convolution. I count not the operations I encounter in the kernel but the operations in the algorithm. Then I determine how much time it takes to run the kernel. Then I divide the 1st number by the 2nd. I got about 400-450 GFLOPS, which is 16%-18% of Peak GFLOPS for R6950 running at 900MHz.



              What do you mean when you say you don't count the number of operations in the kernel but instead the number in the algorithm? Do you mean counting both branches of branching code, or are you talking about something else?

                • How to measure GFLOPS?
                  maximmoroz

                   

                  Originally posted by: notyou

                  What do you mean when you say you don't count the number of operations in the kernel but instead the number in the algorithm? Do you mean counting both branches of branching code, or are you talking about something else?



                  I mean that I count the number of operations not looking into the kernel's source code. I might do it even before I start coding. I look at the algorithm and determine how much time I would need to press "operation" (+,-,*,/) buttons in simple calculator I would use to apply/traverse the whole algorithm by myself.

                    • How to measure GFLOPS?
                      notyou

                       

                      Originally posted by: maximmoroz

                      I mean that I count the number of operations not looking into the kernel's source code. I might do it even before I start coding. I look at the algorithm and determine how much time I would need to press "operation" (+,-,*,/) buttons in simple calculator I would use to apply/traverse the whole algorithm by myself.



                      That makes sense and is what I expected. One thing I'm still left wondering though, is that, in some applications, I see GFLOPS being measured and outputted during program execution. How is this being calculated (since, as you say, peak and average GFLOPS are basically determined by the algorithm and hardware)? Are they just calculating execution time and mapping that to a particular value of GFLOPS (as a ratio to average/peak GFLOPS)? Or is there some other magic going on?

                      Sorry for being such a pain. And thanks for your help.

                      -Matt

                        • How to measure GFLOPS?
                          rick.weber

                          There's no magic. In DGEMM for example, the number of FLOPs needed to compute the answer is 2 * m * n * k, where m, n, and k are the matrix dimensions. So, the execution rate is just (2 * m * n * k) / time.

                          • How to measure GFLOPS?
                            maximmoroz

                             

                            Originally posted by: notyou  One thing I'm still left wondering though, is that, in some applications, I see GFLOPS being measured and outputted during program execution. How is this being calculated (since, as you say, peak and average GFLOPS are basically determined by the algorithm and hardware)? Are they just calculating execution time and mapping that to a particular value of GFLOPS (as a ratio to average/peak GFLOPS)? Or is there some other magic going on?

                            I guess they do it in a most natural way. The program knows how much operations (from the algorithm) it perfromed during specific task. It knows the time it took to run this task. Then the program does single flop - divide the first number by the second one :)

                              • How to measure GFLOPS?
                                rick.weber

                                For some tasks, you can't really measure the performance without changing it significantly. Some applications have data-dependent performance in that based on what the values in the input are, they have to do more or less computation. Measuring throughput in this case is quite a bit nastier as you need to keep a counter of how many blahs you evaluate in the kernel and divide that by time. You then have to rerun on the same dataset without the counting code to make sure it doesn't add significant overhead (e.g. because it's in an inner loop or something).