Since I didn't get a response in one of the other sub-forums I figured I should post this here since my work is also relating to OpenCL.

I've been looking through research papers to make a comparison between a number of architectures including CPUs, the Cell BE and GPUs and I see GFLOPS being used as a unit of measurement but it is never stated exactly how they get their measurements.

Is it as simple as looping the program to execute for roughly a second (or more for greater accuracy I assume) and counting the number of operations (+, -, /, *, etc.) in the kernel?

If so, why do I see numbers such as the average GFLOPS and peak GFLOPS? Continuing, how are these numbers determined?

And, assuming it's not answered by the time you read this, how would I go about measuring GFLOPS for my own CPU and GPU for comparison (using a specific algorithm such as Mersenne Twister for example).

Thanks.

-Matt

Peak GFLOPS is usually understood as the maximum number of floating point operations the hardware is able to perform assuming that all the data for these operations are directly available.

For example, AMD R6950 reference card: 800MHz * 22 compute engines * 16 stream cores in each compute unit * 4 processing element in each stream core * 2 single precision flop (MULADD) = 2252,8 GFLOPS.

Obviously, in real world the data needs to be copied closer to ALU units (host->global buffer->local buffer->register - this consumes processor ticks, the latency should be hidden also), part of ALU horsepower needs to be dedicated to some indexes calculation, cycles, conditions e t.c.

Thus while the Peak GFLOPS might be huge one should consider the real performance of the real kernel, taking into account that this specific kernel might accidently (or intentionally) be optimized for the specific hardware.

For eaxmple, I have a kernel which is able to perform multi-convolutions (each output image is a product of convolutions applied to a number of input images). First, I calculate all addition and multiplication operations it is required to do the convolution. I count not the operations I encounter in the kernel but the operations in the algorithm. Then I determine how much time it takes to run the kernel. Then I divide the 1st number by the 2nd. I got about 400-450 GFLOPS, which is 16%-18% of Peak GFLOPS for R6950 running at 900MHz.