In the APP guide on page 1-12, the following paragraph makes no sense:
"When using these interfaces, it is important to consider the amount of copying involved. There is a two-copy processes: between host and PCIe, and between PCIe and GPU compute device. This is why there is a large performance difference between the system GFLOPS and the kernel GFLOPS."
What does data transfer latency have to do with floating point performance?