3 Replies Latest reply on Apr 15, 2011 5:04 AM by himanshu.gautam

    Overlap memory transfer/computation and latency hiding

    erman_amd

      Hi, I want to ask some basic questions. I'm new to AMD GPU and OpenCL programming, and want to use it in my thesis.

      1. How to overlap computation with memory transfer, so I can compute on one buffer, while doing memory transfer on other buffer? Is it supported on AMD APP SDK? I use AMD APP v2.3.

      2. It is said that the data transfer from CPU memory to GPU memory is executed using DMA. Is it automatically executed by the GPU hardware? Is there a way to access/give command to the DMA engine directly in code (as in IBM Cell processor programming)?

      3. I read in the programming guide about memory latency hiding for a kernel with little ALU activity (the Parallel Min() function example point 4).

      In the code:

      global_work_size = compute_units * 7 * ws (=64) // 7 wavefronts per SIMD 

      How to get the value of '7'?

      I have a 5870 card, which 1 wavefront = 64 work-items. For this card what is the minimal number of wavefront so it can hide the memory latency?

      What is the measure  (or how to mea hsure) that indicate that we success in hiding the memory latency? Does the SKA or Profiler can tell whether or not the memory latency hiding is success?

      In case of my kernel, it has little ALU activity, only 21-30% (shown in profiler).

      Thank you.

       

       

       

       

        • Overlap memory transfer/computation and latency hiding
          himanshu.gautam

          erman_amd,

           You can do memory transfer and computation parallely using DMA engines.

          Refer to the bufferbandwidth and transfer overlap sample for details about how to get best memory access patterns. Also there is detailed description in openCL programming guide in Chapter 4.

          Generally the magic number for any algorithm depends on the algorithm itself. Best way is to use Profiler and try to get ALU Busy value as high as possible for compute intensive kernels.20-30% is not very good and you should try to improve it. Again refer to Chapter4 Programming guide to check out what suits your case.

           

          LAst thing is get AMD APP SDK 2.4 and also install the latest driver.

          Thanks