How to accelerate data transfering between CPU and GPU.

     I've tried to pass 4M float from CPU to GPU through PCIe 2.0 port.

     It costs 36 ms on CAL , while only 11 ms on CUDA. So is there any way to improve the performance?

     i remember there is something called memory pinning on brook+, what's under the hood? Could i use memory pinning on CAL too?

    thanks for ur ideas