when I use this
func(float* dataOut){
stream out;
kernelcalculation
out.write(dataOut)
}
it took 0.409s.
when I use cal
func (float* dataOut)
{
CALmem localRes
CALmem remoteRes
calculations
calMemCopy (copy data from localRes to remoteRes)
memcpy (from remoteRes to dataOut)
}
it took 1.255s
I don't how to increase this copy time in CAL.
Main bottleneck in your implementation is CPU memcopy. Brook+ uses cached remote resources for better CPU memcopy performance.
You can try to do the same. Of course the cached resource available is much less compared to non-cached resources. For big sizes, you can try to implement data transfer in tile-by-tile manner.
what do you mean tile-by-tile manner?
Let say you have a resource of size 1024x1024 and you are not able to allocate cached resource of this size. Break it into 8 tiles of 256x256 and use copy kernel to tarnsfer each tile from device memory to local memory one-by-one.
Thank you
Do you know where is the source code for streamread/write in brook+?
$(BROOKROOT)\platform\runtime\CAL\Managers\CALBufferMgr.cpp
CALBufferMgr::setBufferData
CALBufferMgr::getBufferData
thank you
Thank you for tiled example