cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

erman_amd
Journeyman III

Overlap memory transfer/computation and latency hiding

Hi, I want to ask some basic questions. I'm new to AMD GPU and OpenCL programming, and want to use it in my thesis.

1. How to overlap computation with memory transfer, so I can compute on one buffer, while doing memory transfer on other buffer? Is it supported on AMD APP SDK? I use AMD APP v2.3.

2. It is said that the data transfer from CPU memory to GPU memory is executed using DMA. Is it automatically executed by the GPU hardware? Is there a way to access/give command to the DMA engine directly in code (as in IBM Cell processor programming)?

3. I read in the programming guide about memory latency hiding for a kernel with little ALU activity (the Parallel Min() function example point 4).

In the code:

global_work_size = compute_units * 7 * ws (=64) // 7 wavefronts per SIMD 

How to get the value of '7'?

I have a 5870 card, which 1 wavefront = 64 work-items. For this card what is the minimal number of wavefront so it can hide the memory latency?

What is the measure  (or how to mea hsure) that indicate that we success in hiding the memory latency? Does the SKA or Profiler can tell whether or not the memory latency hiding is success?

In case of my kernel, it has little ALU activity, only 21-30% (shown in profiler).

Thank you.

 

 

 

 

0 Likes
3 Replies
himanshu_gautam
Grandmaster

erman_amd,

 You can do memory transfer and computation parallely using DMA engines.

Refer to the bufferbandwidth and transfer overlap sample for details about how to get best memory access patterns. Also there is detailed description in openCL programming guide in Chapter 4.

Generally the magic number for any algorithm depends on the algorithm itself. Best way is to use Profiler and try to get ALU Busy value as high as possible for compute intensive kernels.20-30% is not very good and you should try to improve it. Again refer to Chapter4 Programming guide to check out what suits your case.

 

LAst thing is get AMD APP SDK 2.4 and also install the latest driver.

Thanks

0 Likes

himanshu,

I can not find the bufferbandwidth and transfer overlap example. Can you point to specific folder in the samples? Or is it in SDK 2.4? I use SDK 2.3 now and will to change to 2.4 as you suggest. Hope the can work if I upgrade to SDK 2.4

 

 

0 Likes

erman_amd,

Yes those samples are in SDK 2.4

Also refer to the latest AMD APP openCL Programming Guide for the special optimization section on buffer transfers. A lot has been added to FAQ document also, which might interest you.

Thanks

0 Likes