I've tried to pass 4M float from CPU to GPU through PCIe 2.0 port.
It costs 36 ms on CAL , while only 11 ms on CUDA. So is there any way to improve the performance?
i remember there is something called memory pinning on brook+, what's under the hood? Could i use memory pinning on CAL too?
thanks for ur ideas