cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

skanur
Journeyman III

Mapping device memory

Jump to solution

Hello all,

While working on my problem, I came across an interesting phenomenon which I'm trying to understand. Basically I create a pinned memory and do data tI ransfer between device and host using clEnqueueWriteBuffer. I get a datarate of about 6 GB/s on a Kaveri CPU with Hawaii GPU connected with PCIe 3 bus. This is maximum as verified by BufferBandwidth sample of AMD. To illustrate the measurement, here is the pseudocode


// Create device and pinned host memory


cl_mem dmem = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * size, NULL, &err); // Error checks are done, but not shown here


cl_mem pinned_hmem = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, size * sizeof(cl_float), NULL, &err);


cl_float *transfer_data = (float*) clEnqueueBuffer(commands, pinned_hmem, CL_TRUE, CL_MAP_WRITE, 0, size * sizeof(cl_float), 0, NULL, NULL, &err);


memcpy(transfer_data, data, sizeof(cl_float) * size); // "data" consists of pre-defined stuff


clEnqueueUnmapMemObject(commands, pinned_hmem, (void*) transfer_data, 0, NULL, NULL);


// map again as read only


transfer_data = (cl_float*) clEnqueueMapBuffer(commands, pinned_hmem, CL_TRUE, CL_MAP_READ, 0, size * sizeof(cl_float), 0, NULL, NULL, &err);


clFinish(commands);



startTimer();


// This is done few iterations and average is calculated


err = clEnqueueWriteBuffer(commands, dmem, CL_FALSE, 0, sizeof(cl_float) * size, transfer_data, 0, NULL, NULL);


endTimer(); // Calculate the transfer rate





However instead of clEnqueueWriteBuffer, if I map the device memory and copy the data, I get a data rate of close to 2.2 GB/s. I'm trying to understand why this discrepancy? Here is the pseudocode


// Creation of device and pinned host memory remains same as above



startTimer();


// This too is averaged out after few iterations


void *mapped_dmem = clEnqueueMapBuffer(commands, dmem, CL_TRUE, CL_MAP_WRITE, 0, sizeof(cl_float) * size, 0, NULL, NULL, &err);


memcpy(mapped_dmem, transfer_data, sizeof(cl_float) * size);


clEnqueueUnmapMemObject(commands, dmem, mapped_dmem, 0, NULL, NULL);


endTimer(); // Calculate the transfer rate





Could someone explain why the transfer rate is almost half?

Thanks for reading

Edit: Updated first pseudocode and put memcpy in right place

Tags (2)
0 Kudos
Reply
1 Solution

Accepted Solutions
tzachi_cohen
Staff
Staff

Re: Mapping device memory

Jump to solution

When you do a map operation the OCL runtime must copy the buffer from device memory to host memory even if you specify 'CL_MAP_WRITE'. When you unmap, the runtime will copy the buffer back to device memory. The half throughput is due to the copies in both ways compared to 'clEnqueueWriteBuffer' which copy only one way.

The reason the runtime must copy both ways is because, according to the spec, on 'CL_MAP_WRITE' the user is not committed to completely overwrite the buffer, hence the runtime must prepare the buffer in case of partial update.

Only on 'CL_MAP_READ' the runtime can waive the copy back to device memory.

View solution in original post

0 Kudos
Reply
4 Replies
jtrudeau
Staff
Staff

Re: Mapping device memory

Jump to solution

A related topic: Pinned memory makes driver very happy

There is a little discussion about why this might be. If that answers your question, then please mark this reply as "correct" so anyone finding this knows to go look at the related topic.

Hope this helps.

tzachi_cohen
Staff
Staff

Re: Mapping device memory

Jump to solution

When you do a map operation the OCL runtime must copy the buffer from device memory to host memory even if you specify 'CL_MAP_WRITE'. When you unmap, the runtime will copy the buffer back to device memory. The half throughput is due to the copies in both ways compared to 'clEnqueueWriteBuffer' which copy only one way.

The reason the runtime must copy both ways is because, according to the spec, on 'CL_MAP_WRITE' the user is not committed to completely overwrite the buffer, hence the runtime must prepare the buffer in case of partial update.

Only on 'CL_MAP_READ' the runtime can waive the copy back to device memory.

View solution in original post

0 Kudos
Reply
nou
Exemplar

Re: Mapping device memory

Jump to solution

That is why CL_MAP_WRITE_INVALIDATE_REGION was introduced.

skanur
Journeyman III

Re: Mapping device memory

Jump to solution

I appreciate your quick response. Thank you very much!

0 Kudos
Reply