Archives Discussions

aisesal · ‎10-23-2012

Let's say I have one large buffer on GPU (tens to hundreds of megabytes). Let's call it main buffer. I need to change parts of it from CPU. It's trivial using clEnqueueMapBuffer or clEnqueueWriteBuffer, but the problem is that I have to update multiple parts of main buffer and updates are small (each one is < 1 KB). Since such small updates are much slower than one large update I'm looking for some tips on how to achieve better bandwidth. Is there any way I could combine multiple small updates into bigger one (note that each update is different size and has different offset into main buffer).

nou · ‎10-23-2012

pack all small changes into one bigger buffer. add another one with offset and size. then run kernel which takes this offsets and sizes and copy them into main buffer. you can split that each workgroup copy one chunk and split this chunk onto workitems in workgroup.

aisesal · ‎10-24-2012

Thanks, I've tried that and it seems to be a little bit faster, but I think I found best solution after reading AMD APP SDK manual. It says that for small updates it's best tu use CL_MEM_USE_PRESISTANT_MEM_AMD and mapping a part of buffer you want to update.

It basically goes like this:

// create context, queue and kernels
...
// create main buffer and a buffer for reading back data to cpu
cl_mem mainBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_PRESISTANT_MEM_AMD, mainBufSize, NULL, NULL);
cl_mem cpuBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, cpuBufSize, NULL, NULL);
char *cpuPtr = (char *)clEnqueueMapBuffer(queue, cpuBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, cpuBufSize, 0, NULL, NULL, NULL);
clEnqueueUnmapMemObject(queue, cpuBuf, cpuPtr, 0, NULL, NULL);
// some other code
...
while (running) {
     // change some data in mainBuf
     for (int i = 0; i < nUpdates; ++i) {
          void *ptr = clEnqueueMapBuffer(queue, gpuBuf, CL_TRUE, CL_MAP_WRITE, updates.offset, update.size, 0, NULL, NULL, NULL);
          memcpy(ptr, updates.source, updates.size); // note that source can be based on cpuPtr or simply user malloc'ed area.
          clEnqueueUnmapMemObject(queue, gpuBuf, ptr, 0, NULL, NULL);
     }
     // run kernel
     ...
     // read back results
     clEnqueueCopyBuffer(queue, gpuBuf, cpuBuf, gpuOff, cpuOff, 0, NULL, NULL);
     clFinish(queue);
     // now (cpuPtr + cpuOff) holds the results that can be directly used by cpu, just cast it to whatever pointer of whatever type you need.
}

According to AMD APP Profiler, time spent on memory transfer operations is reduced by 10 times with this approach, compared to my previous code. It is also faster for large updates.

Archives Discussions

Partial buffer updates.