3 Replies Latest reply on Jul 23, 2013 1:18 AM by himanshu.gautam

    Fastest device to host transfer


      My question is this: how to achieve the fastest device to host transfer speed. The short answer is pinned memory, however my problem is a bit more complex.


      I have a piece of device memory which I have to transfer to a varying address of host memory. So the host memory cannot be prepinned. I use this code:



      void clMemcpyDeviceToHost(void * dst,cl_mem src,int size)


          cl_mem cl_output = clCreateBuffer(m_context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size, dst, NULL);

          void* p_map_output = clEnqueueMapBuffer(m_commandQueue, cl_output, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION , 0, size, 0, NULL, NULL, NULL);

          clEnqueueReadBuffer(m_commandQueue,src, CL_TRUE,0,size,p_map_output,0,NULL,NULL);

          clEnqueueUnmapMemObject(m_commandQueue, cl_output, p_map_output, 0, NULL, NULL);





      It seems that the slowest part is the clEnqueueMapBuffer, so my guess is that it actually copies something which I would not want it to do. I tried to set the block flag to CL_FALSE and put the first two lines before a good amount of computation code so that it could do the mapping while I do something else, but the call still blocks for a good amount of time (twice as long then the copy afterwards).


      Am I doing it wrong? Is there a faster way?



        • Re: Fastest device to host transfer

          How do you find the slowest part? What size of memory are you reading? Are you running it in iterations?

          The API does not seem to be using any cl_events. (EDITED)

          Your code looks reasonable to give good read performance.

            • Re: Fastest device to host transfer

              I measured the speed with using clFinish() calls before and after the opencl codes.

              The memory size I was testing is 32 MB.

              Yes, I am runnin iterations, 1-2000.

              I tried the blocking with events, but it did not help.


              For me the clEnqueueMapBuffer seem to be blocking for 5.5 ms even if it should not block. Is there any way for it to not block, or not copy? Do you have a sample code may be where it is not blocking?