Overlap memory transfer/computation and latency hiding

Discussion created by erman_amd on Apr 14, 2011
Latest reply on Apr 15, 2011 by himanshu.gautam

Hi, I want to ask some basic questions. I'm new to AMD GPU and OpenCL programming, and want to use it in my thesis.

1. How to overlap computation with memory transfer, so I can compute on one buffer, while doing memory transfer on other buffer? Is it supported on AMD APP SDK? I use AMD APP v2.3.

2. It is said that the data transfer from CPU memory to GPU memory is executed using DMA. Is it automatically executed by the GPU hardware? Is there a way to access/give command to the DMA engine directly in code (as in IBM Cell processor programming)?

3. I read in the programming guide about memory latency hiding for a kernel with little ALU activity (the Parallel Min() function example point 4).

In the code:

global_work_size = compute_units * 7 * ws (=64) // 7 wavefronts per SIMD 

How to get the value of '7'?

I have a 5870 card, which 1 wavefront = 64 work-items. For this card what is the minimal number of wavefront so it can hide the memory latency?

What is the measure  (or how to mea hsure) that indicate that we success in hiding the memory latency? Does the SKA or Profiler can tell whether or not the memory latency hiding is success?

In case of my kernel, it has little ALU activity, only 21-30% (shown in profiler).

Thank you.