when I use this
it took 0.409s.
when I use cal
func (float* dataOut)
calMemCopy (copy data from localRes to remoteRes)
memcpy (from remoteRes to dataOut)
it took 1.255s
I don't how to increase this copy time in CAL.
Main bottleneck in your implementation is CPU memcopy. Brook+ uses cached remote resources for better CPU memcopy performance.
You can try to do the same. Of course the cached resource available is much less compared to non-cached resources. For big sizes, you can try to implement data transfer in tile-by-tile manner.
Let say you have a resource of size 1024x1024 and you are not able to allocate cached resource of this size. Break it into 8 tiles of 256x256 and use copy kernel to tarnsfer each tile from device memory to local memory one-by-one.