Hi all,
We have recently worked on this code to showcase, asynchronous DMA + Kernel Execution on AMD GPUs. Please go through it, give feedback. We hope it helps a lot developers and students to achieve better performance using AMD's hardware.
Courtesy: Andryeyev German
Message was edited by: Himanshu Gautam
Very good work AMD!
Many thanks for this. For me, the main point is to use two command queues simultaneously.
But using this code, it is reporting that I get at most 4GB/s of total throughput. I believe that number. And that number is showing I'm not getting the most out of the hardware. If the graphics card has GDDR5, my system has PC3-8500, and they're connected between with x16 PCI 2.0, shouldn't that mean that this code should give throughput closer to 8GB/s? (and double that if the code can be changed so that the read and writes are pinned on different memory sticks)?
Part of the slow-up may be that even with this program pushing a half gigabyte of memory back and forth, the card (a Tahiti) doesn't seem to want to kick up the performance to using the full 16 lanes in the PCI. It seems to be stuck to using 8 lanes. For example, during this AsyncDMA program's run, aticonfig is still showing 8 lanes being utilized...
# aticonfig --pplib-cmd "get activity"
Current Activity is Core Clock: 950MHZ
Memory Clock: 1425MHZ
VDDC: 1170
Activity: 53 percent
Performance Level: 2
Bus Speed: 5000
Bus Lanes: 8
Maximum Bus Lanes: 16
(and if you're wondering, I do have the system bios configured so that 16 lanes go directly to this card's PCI slot)
# lspci -vv -s 04:00.0 | grep LnkCap
LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us
Is there any way to use the full 16 lanes in a program such as this?
This looks ,more like an OS issue.. Can you check performance under Windows?
you're absolutely right. It turned out that even though my BIOS allowed me to select x16, the PCIe riser card in the machine was only x8 capable. Swapped in a true x16 riser and now getting the AsyncDMA program to report the 8.0GB/s that I had expected in the case of allocated host pointer (and so the system's RAM is now my bottleneck)..
Write/Read operation 2 queue; profiling disabled using AHP: 8.01996 GB/s
----------- Time frame 16569.652 (us), scale 1:207
BufferWrite - W>; KernelExecution - X#; BufferRead - R<;
CommandQueue #0
<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<
CommandQueue #1
------W>>>>>>>>>>>>------W>>>>>>>>>>>>------W>>>>>>>>>>>-------W>>>>>>>>>>>>----
Write/Read operation 2 queue; profiling enabled using AHP: 7.76999 GB/s
# aticonfig --pplib-cmd "get activity"
Current Activity is Core Clock: 950MHZ
Memory Clock: 1425MHZ
VDDC: 1170
Activity: 63 percent
Performance Level: 2
Bus Speed: 5000
Bus Lanes: 16
Maximum Bus Lanes: 16
Hi forum,
I am going through the attached source code. The following snippet is not clear to me .
Inside the ProfileQueue::findMinMax(...)
{
}
you are calculating the times taken by each operation (read, write or kernel execution)
clGetEventProfilingInfo(events_[op][0], CL_PROFILING_COMMAND_START, | ||||
sizeof(cl_long), &time, NULL); | ||||
if (0 == *min_time) | ||||
{ | ||||
*min_time = time; | ||||
} | ||||
else | ||||
{ | ||||
*min_time = std::min<cl_long>(*min_time, time); | ||||
} | ||||
clGetEventProfilingInfo(events_[op][events_[op].size() - 1], | ||||
CL_PROFILING_COMMAND_END, sizeof(cl_long), &time, NULL); | ||||
if (0 == *max_time) | ||||
{ | ||||
*max_time = time; | ||||
} | ||||
else | ||||
{ | ||||
*max_time = std::max<cl_long>(*max_time, time); | ||||
} |
As you see you are considering two separate events , should it not be the same event ?
Some explanation would be helpful.
Regards
Sajjad
events_ is a 2D array of cl_events. The function findminmax needs to find the total time taken by all the events together in all the command queues, so we take the min of START time of the first command in all queues(events[op][0]), and the max of the END time for the last command in all queues(events[op][size]).
My two cents is not to get too tied up with the profiling setup and the display of the profiling results of that original code. The profiling was not the major point of the code, but it takes a lot of effort to get past the profiling in that code. The point is that in the GCN chips, it is really simple to asynchronously use three things at once: the dual dma engines and the kernel execution. I had to strip down the code before I understood that there's not much to it -- I'll attach.
For example,
./a.out
...
TEST use_kernel 1, n_bufs 3
Write/Kernel/Read 3 queues, ALLOC_HOST, 6.34 GB/s.
0: W 0.2- 2.6 2.5 ms, X 8.5- 9.3 0.7 ms, R 9.5- 12.3 2.8 ms,
1: W 2.6- 5.1 2.4 ms, X 9.4- 10.1 0.7 ms, R 12.4- 17.5 5.1 ms,
2: W 5.1- 7.6 2.6 ms, X 10.3- 11.0 0.7 ms, R 18.0- 22.7 4.8 ms,
3: W 12.4- 17.5 5.1 ms, X 18.2- 19.0 0.7 ms, R 22.8- 27.9 5.1 ms,
4: W 17.6- 22.7 5.1 ms, X 22.9- 23.7 0.7 ms, R 27.9- 33.0 5.0 ms,
...
The printout is the time in milliseconds of each Read, eXecution and Write (from the time of the first event's queueing). So for example, using 3 command queues, you see that while we're using one DMA engine to read the results of dataset #2 (18.0-22.7ms), the kernel is executing on dataset #3 (18.2-19.0ms), and we're using the second DMA engine to write dataset #4 (17.6-22.7ms).