
How to measure GFLOPS?
maximmoroz Jun 14, 2011 5:23 AM (in response to notyou)Peak GFLOPS is usually understood as the maximum number of floating point operations the hardware is able to perform assuming that all the data for these operations are directly available.
For example, AMD R6950 reference card: 800MHz * 22 compute engines * 16 stream cores in each compute unit * 4 processing element in each stream core * 2 single precision flop (MULADD) = 2252,8 GFLOPS.
Obviously, in real world the data needs to be copied closer to ALU units (host>global buffer>local buffer>register  this consumes processor ticks, the latency should be hidden also), part of ALU horsepower needs to be dedicated to some indexes calculation, cycles, conditions e t.c.
Thus while the Peak GFLOPS might be huge one should consider the real performance of the real kernel, taking into account that this specific kernel might accidently (or intentionally) be optimized for the specific hardware.
For eaxmple, I have a kernel which is able to perform multiconvolutions (each output image is a product of convolutions applied to a number of input images). First, I calculate all addition and multiplication operations it is required to do the convolution. I count not the operations I encounter in the kernel but the operations in the algorithm. Then I determine how much time it takes to run the kernel. Then I divide the 1st number by the 2nd. I got about 400450 GFLOPS, which is 16%18% of Peak GFLOPS for R6950 running at 900MHz.

How to measure GFLOPS?
notyou Jun 16, 2011 6:19 PM (in response to maximmoroz)Originally posted by: maximmoroz
For example, AMD R6950 reference card: 800MHz * 22 compute engines * 16 stream cores in each compute unit * 4 processing element in each stream core * 2 single precision flop (MULADD) = 2252,8 GFLOPS.
This all makes sense.
Originally posted by: maximmoroz
Obviously, in real world the data needs to be copied closer to ALU units (host>global buffer>local buffer>register  this consumes processor ticks, the latency should be hidden also), part of ALU horsepower needs to be dedicated to some indexes calculation, cycles, conditions e t.c.
Thus while the Peak GFLOPS might be huge one should consider the real performance of the real kernel, taking into account that this specific kernel might accidently (or intentionally) be optimized for the specific hardware.
Again, this all makes sense, and the comparisons I'll be making will be based on the fact that the algorithm will be tuned for each architecture.
Originally posted by: maximmoroz
For eaxmple, I have a kernel which is able to perform multiconvolutions (each output image is a product of convolutions applied to a number of input images). First, I calculate all addition and multiplication operations it is required to do the convolution. I count not the operations I encounter in the kernel but the operations in the algorithm. Then I determine how much time it takes to run the kernel. Then I divide the 1st number by the 2nd. I got about 400450 GFLOPS, which is 16%18% of Peak GFLOPS for R6950 running at 900MHz.
What do you mean when you say you don't count the number of operations in the kernel but instead the number in the algorithm? Do you mean counting both branches of branching code, or are you talking about something else?

How to measure GFLOPS?
maximmoroz Jun 17, 2011 1:38 AM (in response to notyou)Originally posted by: notyou
What do you mean when you say you don't count the number of operations in the kernel but instead the number in the algorithm? Do you mean counting both branches of branching code, or are you talking about something else?
I mean that I count the number of operations not looking into the kernel's source code. I might do it even before I start coding. I look at the algorithm and determine how much time I would need to press "operation" (+,,*,/) buttons in simple calculator I would use to apply/traverse the whole algorithm by myself.

How to measure GFLOPS?
notyou Jun 20, 2011 1:05 AM (in response to maximmoroz)Originally posted by: maximmoroz
I mean that I count the number of operations not looking into the kernel's source code. I might do it even before I start coding. I look at the algorithm and determine how much time I would need to press "operation" (+,,*,/) buttons in simple calculator I would use to apply/traverse the whole algorithm by myself.
That makes sense and is what I expected. One thing I'm still left wondering though, is that, in some applications, I see GFLOPS being measured and outputted during program execution. How is this being calculated (since, as you say, peak and average GFLOPS are basically determined by the algorithm and hardware)? Are they just calculating execution time and mapping that to a particular value of GFLOPS (as a ratio to average/peak GFLOPS)? Or is there some other magic going on?
Sorry for being such a pain. And thanks for your help.
Matt

How to measure GFLOPS?
rick.weber Jun 20, 2011 2:08 AM (in response to notyou)There's no magic. In DGEMM for example, the number of FLOPs needed to compute the answer is 2 * m * n * k, where m, n, and k are the matrix dimensions. So, the execution rate is just (2 * m * n * k) / time.

How to measure GFLOPS?
maximmoroz Jun 20, 2011 3:47 AM (in response to notyou)Originally posted by: notyou One thing I'm still left wondering though, is that, in some applications, I see GFLOPS being measured and outputted during program execution. How is this being calculated (since, as you say, peak and average GFLOPS are basically determined by the algorithm and hardware)? Are they just calculating execution time and mapping that to a particular value of GFLOPS (as a ratio to average/peak GFLOPS)? Or is there some other magic going on?
I guess they do it in a most natural way. The program knows how much operations (from the algorithm) it perfromed during specific task. It knows the time it took to run this task. Then the program does single flop  divide the first number by the second one :)

How to measure GFLOPS?
rick.weber Jun 20, 2011 5:03 AM (in response to maximmoroz)For some tasks, you can't really measure the performance without changing it significantly. Some applications have datadependent performance in that based on what the values in the input are, they have to do more or less computation. Measuring throughput in this case is quite a bit nastier as you need to keep a counter of how many blahs you evaluate in the kernel and divide that by time. You then have to rerun on the same dataset without the counting code to make sure it doesn't add significant overhead (e.g. because it's in an inner loop or something).




