I would like to make use of zero-copy in an APU environment for legacy code.
I intend to use the following code for data transfer:
// Create Buffers, somewhere else in the application
inBuf = clCreateBuffer(context, CL_MEM_READ_ONLY, bufSize, NULL, &err); //input
outBuf = clCreateBuffer(context, CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR, bufSize, NULL, &err); //output
// get direct pointer to buffer
inPtr = (unsigned char *) clEnqueueMapBuffer(commands, inBuf, CL_TRUE, CL_MAP_WRITE, 0, bufSize, 0, NULL, NULL, &err);
// do something with the data pointed to by inPtr
clEnqueueUnmapMemObject(commands, inBuf, inPtr, 0, NULL, NULL); //unMap inPtr
// access result
outPtr = (unsigned char *) clEnqueueMapBuffer(commands, outBuf, CL_TRUE, CL_MAP_READ, 0, bufSize, 0, NULL, NULL, &err);
clEnqueueUnmapMemObject(commands, outBuf, outPtr, 0, NULL, NULL); //unMap inPtr
Is this the correct way to perform data transfer?
Also for me low invocation / map overhead is more important than peak-throughput on the GPU: The OpenCL kernels will be executed as part of a legacy application, where there is no way to do double-buffered data transfers, so all the calls to map/unmap should be fast. Do the parameters chosen for buffer creation in the code above make sense to this scenario?
I've created a trace using CodeXL, and map/unmap with code very similar to the above snippit (only with 3 in/out buffers) has quite high overhead compared to the actual kernel invocation:
As you can see, while the kernel executes in ~1.5ms (the first buffer-map is slow, because it has to wait for kernel execution).
However mapping the input buffers is horrible slow (CL_MAP_WRITE), taking 0.18-0.25ms each.
Isn't there anything I can do to reduce this overhead?
The APU I used is an AMD_A10-7800 (Spectre) running Centos-7 with the latest Catalyst drivers.
Thank you in advance, Clemens