Hello all,
While working on my problem, I came across an interesting phenomenon which I'm trying to understand. Basically I create a pinned memory and do data tI ransfer between device and host using clEnqueueWriteBuffer. I get a datarate of about 6 GB/s on a Kaveri CPU with Hawaii GPU connected with PCIe 3 bus. This is maximum as verified by BufferBandwidth sample of AMD. To illustrate the measurement, here is the pseudocode
// Create device and pinned host memory
cl_mem dmem = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * size, NULL, &err); // Error checks are done, but not shown here
cl_mem pinned_hmem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size * sizeof(cl_float), NULL, &err);
cl_float *transfer_data = (float*) clEnqueueBuffer(commands, pinned_hmem, CL_TRUE, CL_MAP_WRITE, 0, size * sizeof(cl_float), 0, NULL, NULL, &err);
memcpy(transfer_data, data, sizeof(cl_float) * size); // "data" consists of pre-defined stuff
clEnqueueUnmapMemObject(commands, pinned_hmem, (void*) transfer_data, 0, NULL, NULL);
// map again as read only
transfer_data = (cl_float*) clEnqueueMapBuffer(commands, pinned_hmem, CL_TRUE, CL_MAP_READ, 0, size * sizeof(cl_float), 0, NULL, NULL, &err);
clFinish(commands);
startTimer();
// This is done few iterations and average is calculated
err = clEnqueueWriteBuffer(commands, dmem, CL_FALSE, 0, sizeof(cl_float) * size, transfer_data, 0, NULL, NULL);
endTimer(); // Calculate the transfer rate
However instead of clEnqueueWriteBuffer, if I map the device memory and copy the data, I get a data rate of close to 2.2 GB/s. I'm trying to understand why this discrepancy? Here is the pseudocode
// Creation of device and pinned host memory remains same as above
startTimer();
// This too is averaged out after few iterations
void *mapped_dmem = clEnqueueMapBuffer(commands, dmem, CL_TRUE, CL_MAP_WRITE, 0, sizeof(cl_float) * size, 0, NULL, NULL, &err);
memcpy(mapped_dmem, transfer_data, sizeof(cl_float) * size);
clEnqueueUnmapMemObject(commands, dmem, mapped_dmem, 0, NULL, NULL);
endTimer(); // Calculate the transfer rate
Could someone explain why the transfer rate is almost half?
Thanks for reading
Edit: Updated first pseudocode and put memcpy in right place
Solved! Go to Solution.
When you do a map operation the OCL runtime must copy the buffer from device memory to host memory even if you specify 'CL_MAP_WRITE'. When you unmap, the runtime will copy the buffer back to device memory. The half throughput is due to the copies in both ways compared to 'clEnqueueWriteBuffer' which copy only one way.
The reason the runtime must copy both ways is because, according to the spec, on 'CL_MAP_WRITE' the user is not committed to completely overwrite the buffer, hence the runtime must prepare the buffer in case of partial update.
Only on 'CL_MAP_READ' the runtime can waive the copy back to device memory.
A related topic: Pinned memory makes driver very happy
There is a little discussion about why this might be. If that answers your question, then please mark this reply as "correct" so anyone finding this knows to go look at the related topic.
Hope this helps.
When you do a map operation the OCL runtime must copy the buffer from device memory to host memory even if you specify 'CL_MAP_WRITE'. When you unmap, the runtime will copy the buffer back to device memory. The half throughput is due to the copies in both ways compared to 'clEnqueueWriteBuffer' which copy only one way.
The reason the runtime must copy both ways is because, according to the spec, on 'CL_MAP_WRITE' the user is not committed to completely overwrite the buffer, hence the runtime must prepare the buffer in case of partial update.
Only on 'CL_MAP_READ' the runtime can waive the copy back to device memory.
That is why CL_MAP_WRITE_INVALIDATE_REGION was introduced.
I appreciate your quick response. Thank you very much!