cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lukas_erlinghagen
Journeyman III

How to do concurrent data transfers while a kernel is executing?

Hi everyone,

I'm currently working on optimizing an OpenCL C++ application. The platform is an A8-3870 APU with an additional Radeon HD7750 GPU (Capeverde). I'm using the Catalyst 12.10 driver.

The application has several kernels and buffer reads, set up as a task graph using events for synchronization, and are running on command queues that have the out-of-order bit set. The kernels are scheduled on the HD7750 GPU queue, while the memory transfers are scheduled on the CPU queue. The buffer objects in question are created with CL_MEM_READ_WRITE | CL_MEM_HOST_READ_ONLY | CL_MEM_USE_PERSISTENT_MEM_AMD. All command queues are flushed before the final call to clFinish().

I expect to see the memory transfers happen immediately after the kernel they depend on has finished. CodeXL's timeline (1.0.2409.0), however, indicates that the memory transfers in the CPU command queue only happen after all kernels in the GPU queue have finished.

When the memory transfers are scheduled on the GPU queue, they are executed immediately after the kernel they depend on has finished, but independent kernels are not executed in parallel. The command queue behaves like an in-order-queue.

What am I missing here? Thanks in advance for any hints.

0 Likes
1 Solution
himanshu_gautam
Grandmaster

Hi Lukas,

Here is my guess:

1. You really don't need 2 command queues to overlap memory transfer with kernel execution.

2. Use only 1 command queue

3. Use pinned memory for your buffer

4. Queue the kernel and the memory transfer one after another

I am in the process of writing a program to verify this. I will post the results next week.

View solution in original post

0 Likes
9 Replies
nou
Exemplar

IIRC if is profiling enabled DMA transfer are disabled.

Thanks for the hint. Is this documented somewhere?

0 Likes

only as comment from AMD developer. you must measure it with own timers and compare with CodeXL results. If it is shorter then there are concurrent transfer.

0 Likes
himanshu_gautam
Grandmaster

Hi lukas,

AFAIK, out-of-order queues are not supported by AMD's OpenCL runtime. So the behaviour you are reporting should be expected.

Alternately you can use cl_event objects, to make sure your commands execute in correct order.

I'm already using cl_events to create a task-graph.

Since the runtime doesn't support out-of-order queues: Is there a way to get/confirm the behaviour I'm looking for: Transfering kernel results back to the host while another kernel is executing? I'll investigate whether a blocking clEnqueueMapBuffer before the next batch of kernels might help me out here.

0 Likes
himanshu_gautam
Grandmaster

Can you please tell us:

1. Devices on the OpenCL "context"

2. Some sample code on how you are creating the buffers on this context

Also, You could try "Mapping" the buffers from the CPU queue. It is possible that you may enjoy zero-copy support - if you have the right configuration.

Check Table 4.2 (in AMD APP Programming Guide) -- that helps you with the location placement of various OpenCL memory objects for different flags.

However, I am not too sure what "VM" means in the table. If somebody could throw some light, it will be useful.

0 Likes

my guess is that VM stands for Virtual Memory.

0 Likes

Hello Nou,

VM refers to Virtual memory support in Graphics driver- which will be used by the OpenCL runtime.

You can check this out by querying the driver version. Say:  Driver version:                                1112.0 (VM)

I think this is related to zero-copy support.

Will talk to AMD Engg team to include this in the AMD Programming Guide

I am not too sure if one can write portable OpenCL code by taking advantage of buffer-placements with or without VM enabled.

But defintely, you can write optimal code for AMD platforms.

Thanks,

0 Likes
himanshu_gautam
Grandmaster

Hi Lukas,

Here is my guess:

1. You really don't need 2 command queues to overlap memory transfer with kernel execution.

2. Use only 1 command queue

3. Use pinned memory for your buffer

4. Queue the kernel and the memory transfer one after another

I am in the process of writing a program to verify this. I will post the results next week.

0 Likes