6 Replies Latest reply on Feb 11, 2012 7:33 PM by cadorino

    Getting memory transfer times when using CL_MEM_USE_HOST_PTR




      I would like to know if it's possible to get profiling information about data transfer times when using buffers allocated with the CL_MEM_USE_HOST_PTR flag. Currently, I can't seem to find anything in the profiler output (sprofile 2.4 from AMD APP 2.6) about the data up or download times.

      My code is quite simple: it creates the buffers (with the USE_HOST_PTR flag), unmaps them (I don't think this is even necessary), and then maps the destination buffer to read the data from the GPU after kernel execution.

      In the API trace from sprofile I see that the map command does specify a memory size, although the runtimes are exceptionally short; for the unmapings, I don't even get to see the memory transfer size.  Is this information simply unavailable, or is there some other way to extract it?

        • Re: Getting memory transfer times when using CL_MEM_USE_HOST_PTR

          When you allocate buffer with CL_MEM_ALLOC_HOST_PTR, there is no transfer going on(zero copy). If you create a buffer without a flag, there is data transfer from host to device (except device being CPU), but time stamp information is not available.


          When you map the data for host access, since the memory object resides in the host memory, there is no transfer therefore map time is small. Transfer size is not shown in unmap block as it's the same as shown in the map block.

          1 of 1 people found this helpful
            • Re: Getting memory transfer times when using CL_MEM_USE_HOST_PTR

              Thank you for your reply. I understand that even with zero-copy, when the device is not the CPU nor an APU and therefore has its own device memory, the data will still get transferred to the device somehow; I was expecting that the transfer could somehow be benchmarked.


              The reason for the question is the following: while the GPU kernel runtime with CL_MEM_USE_HOST_PTR is (as expected) higher than when buffers are copied to the GPU (either with the COPY_HOST_PTR or by hand separately), the longer runtime for the USE_HOST_PTR case is still lower than the total upload+runtime+download times in the copy case.


              Maybe the question should be geared more towards what happens exactly in the zero-copy case with GPU. Does the GPU (Radeon HD 6970 in our case) access the host memory directly via DMA, or are the buffer still cached on its device memory?


              The reason why this is relevant in ourcase is that we are doing one-shot processing, and the data transfer times accounts for 99% of the total program runtime (100-150ms for data transfers against 2-4ms for processing), whereas while using zero-copy we have kernel runtimes which are 3-4 times worse, but still much better than the 150ms with the copy transfers;so our question is: are we actually getting a benefit in this case by using zero-copy, or is it still taking the same time, in a way that we can't measure because data exchange is handled automatically?