In the APP guide on page 1-12, the following paragraph makes no sense:
"When using these interfaces, it is important to consider the amount of copying involved. There is a two-copy processes: between host and PCIe, and between PCIe and GPU compute device. This is why there is a large performance difference between the system GFLOPS and the kernel GFLOPS."
What does data transfer latency have to do with floating point performance?
Oh, so this basically means if I do
copy
kernel
copy
the time spend doing the whole thing is longer than doing just the kernel. I was thrown of by the use of the specific term "FLOPs" when memory transfers have nothing to do with floating point operations.
Yes it is confusing, however quite correct. Actual FLOPs (as opposed to theoretical FLOPs) is a measurement of math work completed per unit of time (measured in seconds, but not for A second). Unlike theoretical FLOPs, it's a snapshot, and depending at what point you take this 'snapshot' it can range from theoretical FLOPs (Max) to 0 FLOPs (i.e. start/end period=single memory read operation with no calculation). System FLOPs should be determined within a period of time that encompases atleast one cycle of all applied operations (such as copy, calculate etc.), with a better approximation at T approaches infinity.
What a load of pointless waffle!