It's much faster than an off-device copy, since it doesn't you know, have to go off-device.
I.e. it should run at some good proportion of the global memory bandwidth, not the PCIe one.
I know it should be must faster than off-device copy. I only wanted to be sure. I've profiled an application that uses this API and it is quite lighter.
Thanks for the reply =)