Thanks, I've tried that and it seems to be a little bit faster, but I think I found best solution after reading AMD APP SDK manual. It says that for small updates it's best tu use CL_MEM_USE_PRESISTANT_MEM_AMD and mapping a part of buffer you want to update.
It basically goes like this:
// create context, queue and kernels
...
// create main buffer and a buffer for reading back data to cpu
cl_mem mainBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_USE_PRESISTANT_MEM_AMD, mainBufSize, NULL, NULL);
cl_mem cpuBuf = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, cpuBufSize, NULL, NULL);
char *cpuPtr = (char *)clEnqueueMapBuffer(queue, cpuBuf, CL_TRUE, CL_MAP_READ | CL_MAP_WRITE, 0, cpuBufSize, 0, NULL, NULL, NULL);
clEnqueueUnmapMemObject(queue, cpuBuf, cpuPtr, 0, NULL, NULL);
// some other code
...
while (running) {
// change some data in mainBuf
for (int i = 0; i < nUpdates; ++i) {
void *ptr = clEnqueueMapBuffer(queue, gpuBuf, CL_TRUE, CL_MAP_WRITE, updates.offset, update.size, 0, NULL, NULL, NULL);
memcpy(ptr, updates.source, updates.size); // note that source can be based on cpuPtr or simply user malloc'ed area.
clEnqueueUnmapMemObject(queue, gpuBuf, ptr, 0, NULL, NULL);
}
// run kernel
...
// read back results
clEnqueueCopyBuffer(queue, gpuBuf, cpuBuf, gpuOff, cpuOff, 0, NULL, NULL);
clFinish(queue);
// now (cpuPtr + cpuOff) holds the results that can be directly used by cpu, just cast it to whatever pointer of whatever type you need.
}
According to AMD APP Profiler, time spent on memory transfer operations is reduced by 10 times with this approach, compared to my previous code. It is also faster for large updates.