Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Different methods to create a memory buffer

What is the most effective

OpenCL specification give choise of 3 methods of createing memory buffer. I wonder which of them will work faster.


1. call clCreateBuffer(), then clEnqueueWriteBuffer(), then run kernel

2. call clCreateBuffer() with CL_MEM_USE_HOST_PTR flag

3.  call clCreateBuffer(), then Map buffer, then manually write data to it, then unmap and run kernel


What alignment required from host memory pointer in methods 1 and 2? I suppose it will significantly influence data transfer speed

9 Replies
Adept II

I have a similar question. Between kernel invocations I need to transfer some data to a GPU buffer, but I'm unsure which transfer method is best.

It would be nice to see benchmarks covering all those methods of memory transferring in OpenCL.


Off the top of my head I don't know the answer, it wouldn't be too hard for you to write a benchmark that tests it, though.


A benchmark would be nice but theoretically clEnqueueMapBuffer should be faster.

I don't see how it could be slower, as it gives you memory directly accessible from the GPU, while clEnqueueRead/WriteBuffer probably involves a second copy.

The flags passed to clCreateBuffer (CL_MEM_READ/WRITE_ONLY) and clEnqueueMapBuffer (CL_MAP_READ/WRITE) are hopefully used to avoid unneeded transfers.


I've conducted some benchmarks. Mapped buffers work satisfactory - like local mem. memset() function can send data with speed near 3 Gb/s. More accelelrated functions whould work with full speed which near 6-7Gb/s for radeon 4850. Data transfers from GPU to host work very poor in all modes. I have 600 Mb/s. This was continuosly reported in another sthread. The problem was fixed partially (for linux). But on my system (Windows XP x64 + cat 10.10 + AMD790FX + Radeon4850) the speed is low.



Could you share your benchmarks?


That test not in good form, it can run only from my bigger project. However it is ease to make suc test from PCIeSpeed included in samples.


Originally posted by: tanq I've conducted some benchmarks. Mapped buffers work satisfactory - like local mem. memset() function can send data with speed near 3 Gb/s.


Data transfers from GPU to host work very poor in all modes.

Sadly, the reason they work like local memory is probably that they are local memory. It is my impression, based purely on timing tests from like half a year ago, that mapping and unmapping does not actually map, but copies the buffer to host memory and back again respectively. If you are going to do timing tests I highly recommmend timing the entire transfer, including the time it takes to map/unmap. Most likely the mapping and unmapping functions will take more time than whatever it is you do to write data to the buffer.

There were a few threads about this a while ago, and I believe the consensus was that mapping was a waste of time, with several people ending up with mapping taking more time than reads and writes(a rough factor 2 if I remember correctly). That said, I agree with iya that mapping has potential. Such mechanisms usually work great with write gathering etc. Overall, if transfer speeds are important for you, I suggest wrapping transfer logic in your own buffer wrapper and implement both mapping and read/write. Do a quick speed test when your app starts and use whatever is faster for the devices you're using. That way you don't have to attempt to guess what the specific opencl implementation your application is using has done to implment these operations.



hopefuly AMD will implement in next release of SDK DMA tranfers.


From a benchmark of my current program, it looks like mapping is rather slow.

But there may be other points to consider:

You can work directly with the mapped buffer, and not just do a memcpy.

You can also keep working on the CPU while the data is being transferred. There is no asynchoronous version of clCreateBuffer, leaving at most manual multithreading.

Total transfer time should still be small compared to kernel execution time. so it's not a huge issue.

In this case, the GPU memory bandwidth is >100x the initial transfer bandwidth, but with 2560 threads, the kernel time is still >25 x the transfer time. That's why I didn't bench the reads, which were pretty small, anyway.


CPU: Intel E5200 (200 MHz FSB)

RAM: DDR2-667


Catalyst 10.11

Windows 7 x64

The OpenCL program is a 32 bit dll.

---------------------------------------------------------------------------------------------------- With Initialization: 104504 KiB clCreateBuffer(CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR) = 104.1534863 ms = 1.0274497 GB/s upload 1st mapping (not necessary since we used CL_MEM_COPY_HOST_PTR) clEnqueueMapBuffer(CL_MAP_WRITE) = 3.1084368 ms. memcpy = 75.3103040 ms. clEnqueueUnmapMemObject = 258.8472810 ms = 0.4134193 GB/s upload 2nd mapping clEnqueueMapBuffer(CL_MAP_WRITE) = 146.6805099 ms. <- This looks like an implementation bug, as we don't want to read from the GPU!? memcpy = 76.7829847 ms. clEnqueueUnmapMemObject = 252.9093370 ms = 0.4231258 GB/s upload Kernel time = 20.1328 s, GPU performance = 543 GFLOP/s, bandwidth = 54 GB/s ---------------------------------------------------------------------------------------------------- Without Initialization: 104504 KiB clCreateBuffer(CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR) = 0.0566297 ms 1st mapping clEnqueueMapBuffer(CL_MAP_WRITE) = 1.4221241 ms. memcpy = 99.5423234 ms. clEnqueueUnmapMemObject = 260.4663920 ms = 0.4108494 GB/s upload 2nd mapping clEnqueueMapBuffer(CL_MAP_WRITE) = 1.6227609 ms. memcpy = 114.6823302 ms. clEnqueueUnmapMemObject = 261.0660110 ms = 0.4099058 GB/s upload