I would like to know if it's possible to get profiling information about data transfer times when using buffers allocated with the CL_MEM_USE_HOST_PTR flag. Currently, I can't seem to find anything in the profiler output (sprofile 2.4 from AMD APP 2.6) about the data up or download times.
My code is quite simple: it creates the buffers (with the USE_HOST_PTR flag), unmaps them (I don't think this is even necessary), and then maps the destination buffer to read the data from the GPU after kernel execution.
In the API trace from sprofile I see that the map command does specify a memory size, although the runtimes are exceptionally short; for the unmapings, I don't even get to see the memory transfer size. Is this information simply unavailable, or is there some other way to extract it?
When you allocate buffer with CL_MEM_ALLOC_HOST_PTR, there is no transfer going on(zero copy). If you create a buffer without a flag, there is data transfer from host to device (except device being CPU), but time stamp information is not available.
When you map the data for host access, since the memory object resides in the host memory, there is no transfer therefore map time is small. Transfer size is not shown in unmap block as it's the same as shown in the map block.
Thank you for your reply. I understand that even with zero-copy, when the device is not the CPU nor an APU and therefore has its own device memory, the data will still get transferred to the device somehow; I was expecting that the transfer could somehow be benchmarked.
The reason for the question is the following: while the GPU kernel runtime with CL_MEM_USE_HOST_PTR is (as expected) higher than when buffers are copied to the GPU (either with the COPY_HOST_PTR or by hand separately), the longer runtime for the USE_HOST_PTR case is still lower than the total upload+runtime+download times in the copy case.
Maybe the question should be geared more towards what happens exactly in the zero-copy case with GPU. Does the GPU (Radeon HD 6970 in our case) access the host memory directly via DMA, or are the buffer still cached on its device memory?
The reason why this is relevant in ourcase is that we are doing one-shot processing, and the data transfer times accounts for 99% of the total program runtime (100-150ms for data transfers against 2-4ms for processing), whereas while using zero-copy we have kernel runtimes which are 3-4 times worse, but still much better than the 150ms with the copy transfers;so our question is: are we actually getting a benefit in this case by using zero-copy, or is it still taking the same time, in a way that we can't measure because data exchange is handled automatically?
BufferBandwidth from APP SDK benchmarks all types of transfer.
Buffer created with CL_MEM_USE_HOST_PTR flag is transferred to device implicitly for discrete GPUs.
In order for discrete GPUs to access host memory directly, you need to use CL_MEM_ALLOC_HOST_PTR flag.
Table 4.2 from Programming guide gives you an general idea of what the peak bandwidth for different types of data transfer.
Thank you, I will also have a look at the BufferBandwidth example. I also noticed that the programming guide (at least up to version 1.3f) mentions that zero-copy for discrete GPUs is not supported in Linux. Is this still the case?
It seems this is not really true that ALLOC_HOST_POINTER makes the driver to allocate memory on the host. For the HD 5870 the bandwidth in linearly reading data from such a buffer is over 150 GB/s. As I reported here: http://devgurus.amd.com/thread/158589
this event makes me think that ALLOC_HOST_POINTER flag is only a suggestion. Depending on the driver, the buffer can be allocate on the host or on the device.