My question is this: how to achieve the fastest device to host transfer speed. The short answer is pinned memory, however my problem is a bit more complex.
I have a piece of device memory which I have to transfer to a varying address of host memory. So the host memory cannot be prepinned. I use this code:
void clMemcpyDeviceToHost(void * dst,cl_mem src,int size)
{
cl_mem cl_output = clCreateBuffer(m_context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size, dst, NULL);
void* p_map_output = clEnqueueMapBuffer(m_commandQueue, cl_output, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION , 0, size, 0, NULL, NULL, NULL);
clEnqueueReadBuffer(m_commandQueue,src, CL_TRUE,0,size,p_map_output,0,NULL,NULL);
clEnqueueUnmapMemObject(m_commandQueue, cl_output, p_map_output, 0, NULL, NULL);
clReleaseMemObject(cl_output);
}
It seems that the slowest part is the clEnqueueMapBuffer, so my guess is that it actually copies something which I would not want it to do. I tried to set the block flag to CL_FALSE and put the first two lines before a good amount of computation code so that it could do the mapping while I do something else, but the call still blocks for a good amount of time (twice as long then the copy afterwards).
Am I doing it wrong? Is there a faster way?
Thanks