Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Asynchronous pinned transfers?

I'm trying to modify BufferBandwidth from the AMDAPP SDK so that I can run multiple threads concurrently transferring data either to the same device, or a second device in the system.

I'm doing fine in single thread, HOST->DEVICE (basically do the same thing as the SDK code).

When I use two threads, both sending data host to device, I seem to get full speed with one thread and half speed (twice as long to transfer) for the second thread.

Is there an issue using two pinned buffers simultaneously? I have a tracing tool from my company that I use and am able to see that the threads start within microseconds of each other, but one finishes in twice as long. Expectation of course would be that both finish around the same time. Are mapped writes not a DMA?

Each thread has its own context and queue.

Each thread does the following:


  // create host buffer

  mem_host = clCreateBuffer(context, CL_MEM_READ_ONLY, data_bytes, host_ptr, &ret);

  // Create scratch memory for the mapped device memory

  void* memscratch;

  posix_memalign( &memscratch, 4096, data_bytes ); //Create a buffer aligned at 4096 byte blocks

  // Create the Device buffer

  mem_dev = clCreateBuffer( context,

                            CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,


                            memscratch, &ret);

  // Map the device buffer (pre-pin it)

  void* dev_ptr;

  dev_ptr=(void*)clEnqueueMapBuffer(  command_queue,



                                      CL_MAP_READ | CL_MAP_WRITE,



                                      0, NULL,

                                      NULL, &ret);

  // Flush/finish the command





    clEnqueueWriteBuffer( command_queue,


                                CL_FALSE, 0,



                                0, NULL, &ev);


    cl_int param_value;

    size_t param_value_size_ret;



      ret |= clGetEventInfo( ev,


          sizeof( cl_int ),


          &param_value_size_ret );

      if( param_value == CL_COMPLETE )



    clReleaseEvent( ev );

    tod_res = stop_tod_timer(&start_timer);

8 Replies

it does get serialized. you should use profiling information from events and not use your timing functions as opencl calls are asynchronous.


Can you please explain a little further? What I have shown is what is happening in each thread, which I even set the affinity of each thread to a different CPU core. My timing functions will not effect the asynchronous behavior, as they are happening in different threads.

I think all it comes down to is whether or not pinned memory transactions are asynchronous in their transfer, or if I must use non-pinned for this.



your transfer get serialized as there is only one bus. you begin to measure time in threads A,B and enqueue write. it begin transfer write from thread A and after that from thread B. so you measure real transfer time in thread A but double time in thread B as it must wait for transfer from A to finish as it can't transfer it at the same time. so you measture in thread B transfer of A+B.

also to get pinned memory buffer you need create it with CL_MEM_ALLOC_HOST_PTR.

and your code is wrong as you first map content of buffer and then enqueue write. you should not write/read from buffer when it is mapped.


The behavior of OpenCL function calls that enqueue commands that write or copy to regions of a memory object that are mapped is undefined.

and last thing ue clWaitForEvents() instead of that busy loop and clGetEventProfilingInfo() to get proper timing.

Interesting.. I am copied my method of transfer from the BufferBandwidth sample in the AMDAPP SDK, if you pass the -dma or -pcie flag to it to test bandwidth, this is the process it uses. I will verify and if so I will ask teh AMDAPP support their opinion.


PCIe is bi-directional, however most AMD GPUs only have a single DMA engine to do async copies.  One thing you can do is schedule a copy from the GPU to the CPU and then perform a CPU copy to the GPU via persistent memory (assuming VM is enabled for your device and OS).


Thanks Jeff. I believe its the single DMA engine that I was hitting.

About the mapping buffer, I don't believe the Khronos OpenCL spec's statement about undefined behavior if a buffer is mapped is accurate for AMD, nor NVIDIA for that matter. Both AMD and NVIDIAs bandwidth samples map the buffer to a host pointer and use that pointer to target the transfer. I believe this is how you get "pinned memory" with OpenCL. The spec may not have defined this behavior but it appears NVIDIA and AMD have.

One curious thing, which maybe I need a separate thread for, is that AMD samples map the device buffer to a pointer at the host (called dev_ptr in my example) and do host to GPU transfer as a write from the host buffer to the mapped pointer. The NVIDIA samples, map the host buffer to a pointer, and do the write from the host pointer to the device buffer. Is there a significance to this? can it be done either way arbitrarily?


Maybe I am misunderstanding you, but using a pinned host pointer as the source or destination of a clEnqueueReadBuffer/clEnqueueWriteBuffer command is much different than dispatching a kernel

I guess I am confused by your second question.  When you do a Map operation, you always get a host pointer back.  Doesn't the AMD sample use CL_MEM_ALLOC_HOST_PTR for this buffer?  If you can point me to the code you have a question about, that would be helpful.


Sorry for the delay, took me a bit to find time to write out my thought.. and then I realized that I was incorrect in my reading of the AMD code. I thought the BufferBandwidth input buffer was host side, and copy buffer device side, but it is the reverse. The only difference between AMD's code and NVIDIA's code is that AMD uses posix_memalign to create a host buffer, and creates buffer using CL_MEM_USE_HOST_PTR where NVIDIA uses CL_MEM_ALLOC_HOST_PTR

Is there any functional difference here, except that using CL_MEM_USE_HOST_PTR with posix_memalign would require you to change the code if 4096 byte alignment became inefficient in future architectures?