cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

himanshu_gautam
Grandmaster

Asynchronous DMA + Kernel Execution using AMD GPUs

Hi all,

We have recently worked on this code to showcase, asynchronous DMA + Kernel Execution on AMD GPUs. Please go through it, give feedback. We hope it helps a lot developers and students to achieve better performance using AMD's hardware.

Courtesy: Andryeyev German

Message was edited by: Himanshu Gautam

7 Replies
himanshu_gautam
Grandmaster

Very good work AMD!

0 Likes
kd2
Adept II

Many thanks for this. For me, the main point is to use two command queues simultaneously.

But using this code, it is reporting that I get at most 4GB/s of total throughput. I believe that number. And that number is showing I'm not getting the most out of the hardware. If the graphics card has GDDR5, my system has PC3-8500, and they're connected between with x16 PCI 2.0, shouldn't that mean that this code should give throughput closer to 8GB/s? (and double that if the code can be changed so that the read and writes are pinned on different memory sticks)?

Part of the slow-up may be that even with this program pushing a half gigabyte of memory back and forth, the card (a Tahiti) doesn't seem to want to kick up the performance to using the full 16 lanes in the PCI. It seems to be stuck to using 8 lanes. For example, during this AsyncDMA program's run, aticonfig is still showing 8 lanes being utilized...

# aticonfig --pplib-cmd "get activity"

Current Activity is Core Clock: 950MHZ

Memory Clock: 1425MHZ

VDDC: 1170

Activity: 53 percent

Performance Level: 2

Bus Speed: 5000

Bus Lanes: 8

Maximum Bus Lanes: 16

(and if you're wondering, I do have the system bios configured so that 16 lanes go directly to this card's PCI slot)

# lspci -vv -s 04:00.0 | grep LnkCap

                LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us

Is there any way to use the full 16 lanes in a program such as this?

0 Likes

This looks ,more like an OS issue.. Can you check performance under Windows?

0 Likes

you're absolutely right. It turned out that even though my BIOS allowed me to select x16, the PCIe riser card in the machine was only x8 capable. Swapped in a true x16 riser and now getting the AsyncDMA program to report the 8.0GB/s that I had expected in the case of allocated host pointer (and so the system's RAM is now my bottleneck)..

Write/Read operation 2 queue; profiling disabled using AHP: 8.01996 GB/s

----------- Time frame 16569.652 (us), scale 1:207

BufferWrite - W>; KernelExecution - X#; BufferRead - R<;

CommandQueue #0

<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<

CommandQueue #1

------W>>>>>>>>>>>>------W>>>>>>>>>>>>------W>>>>>>>>>>>-------W>>>>>>>>>>>>----

Write/Read operation 2 queue; profiling enabled using AHP: 7.76999 GB/s

# aticonfig --pplib-cmd "get activity"

Current Activity is Core Clock: 950MHZ

Memory Clock: 1425MHZ

VDDC: 1170

Activity: 63 percent

Performance Level: 2

Bus Speed: 5000

Bus Lanes: 16

Maximum Bus Lanes: 16

0 Likes

Hi forum,

I am going through the attached source code. The following snippet is not clear to me .

Inside the ProfileQueue::findMinMax(...)

{

}

you are calculating the times taken by each operation (read, write or kernel execution)


clGetEventProfilingInfo(events_[op][0], CL_PROFILING_COMMAND_START,




sizeof(cl_long), &time, NULL);

if (0 == *min_time)

{

    *min_time = time;

}

else

{

    *min_time = std::min<cl_long>(*min_time, time);

}

clGetEventProfilingInfo(events_[op][events_[op].size() - 1],




CL_PROFILING_COMMAND_END, sizeof(cl_long), &time, NULL);

if (0 == *max_time)

{

    *max_time = time;

}

else

{

    *max_time = std::max<cl_long>(*max_time, time);

}

As you see you are considering two separate events , should it not be the same event ?

Some explanation would be helpful.

Regards

Sajjad

0 Likes

events_ is a 2D array of cl_events. The function findminmax needs to find the total time taken by all the events together in all the command queues, so we take the min of START time of the first command in all queues(events[op][0]), and the max of the END time for the last command in all queues(events[op][size]).

0 Likes

My two cents is not to get too tied up with the profiling setup and the display of the profiling results of that original code. The profiling was not the major point of the code, but it takes a lot of effort to get past the profiling in that code. The point is that in the GCN chips, it is really simple to asynchronously use three things at once: the dual dma engines and the kernel execution. I had to strip down the code before I understood that there's not much to it -- I'll attach.

For example,

./a.out

...

TEST use_kernel 1, n_bufs 3

Write/Kernel/Read  3 queues,  ALLOC_HOST,  6.34 GB/s.

   0: W   0.2-  2.6  2.5 ms, X   8.5-  9.3  0.7 ms, R   9.5- 12.3  2.8 ms,

   1: W   2.6-  5.1  2.4 ms, X   9.4- 10.1  0.7 ms, R  12.4- 17.5  5.1 ms,

   2: W   5.1-  7.6  2.6 ms, X  10.3- 11.0  0.7 ms, R  18.0- 22.7  4.8 ms,

   3: W  12.4- 17.5  5.1 ms, X  18.2- 19.0  0.7 ms, R  22.8- 27.9  5.1 ms,

   4: W  17.6- 22.7  5.1 ms, X  22.9- 23.7  0.7 ms, R  27.9- 33.0  5.0 ms,

...

The printout is the time in milliseconds of each Read, eXecution and Write (from the time of the first event's queueing). So for example, using 3 command queues, you see that while we're using one DMA engine to read the results of dataset #2 (18.0-22.7ms), the kernel is executing on dataset #3 (18.2-19.0ms), and we're using the second DMA engine to write dataset #4 (17.6-22.7ms).