7 Replies Latest reply on May 22, 2013 6:50 AM by kd2

    Asynchronous DMA  + Kernel Execution using AMD GPUs

    himanshu.gautam

      Hi all,

      We have recently worked on this code to showcase, asynchronous DMA + Kernel Execution on AMD GPUs. Please go through it, give feedback. We hope it helps a lot developers and students to achieve better performance using AMD's hardware.

      Courtesy: Andryeyev German

       

      Message was edited by: Himanshu Gautam

        • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
          kd2

          Many thanks for this. For me, the main point is to use two command queues simultaneously.

           

          But using this code, it is reporting that I get at most 4GB/s of total throughput. I believe that number. And that number is showing I'm not getting the most out of the hardware. If the graphics card has GDDR5, my system has PC3-8500, and they're connected between with x16 PCI 2.0, shouldn't that mean that this code should give throughput closer to 8GB/s? (and double that if the code can be changed so that the read and writes are pinned on different memory sticks)?

           

          Part of the slow-up may be that even with this program pushing a half gigabyte of memory back and forth, the card (a Tahiti) doesn't seem to want to kick up the performance to using the full 16 lanes in the PCI. It seems to be stuck to using 8 lanes. For example, during this AsyncDMA program's run, aticonfig is still showing 8 lanes being utilized...

           

          # aticonfig --pplib-cmd "get activity"

          Current Activity is Core Clock: 950MHZ

          Memory Clock: 1425MHZ

          VDDC: 1170

          Activity: 53 percent

          Performance Level: 2

          Bus Speed: 5000

          Bus Lanes: 8

          Maximum Bus Lanes: 16

           

          (and if you're wondering, I do have the system bios configured so that 16 lanes go directly to this card's PCI slot)

          # lspci -vv -s 04:00.0 | grep LnkCap

                          LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Latency L0 <64ns, L1 <1us

           

          Is there any way to use the full 16 lanes in a program such as this?

            • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
              himanshu.gautam

              This looks ,more like an OS issue.. Can you check performance under Windows?

                • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
                  kd2

                  you're absolutely right. It turned out that even though my BIOS allowed me to select x16, the PCIe riser card in the machine was only x8 capable. Swapped in a true x16 riser and now getting the AsyncDMA program to report the 8.0GB/s that I had expected in the case of allocated host pointer (and so the system's RAM is now my bottleneck)..

                   

                  Write/Read operation 2 queue; profiling disabled using AHP: 8.01996 GB/s

                  ----------- Time frame 16569.652 (us), scale 1:207

                  BufferWrite - W>; KernelExecution - X#; BufferRead - R<;

                  CommandQueue #0

                  <<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<<R<<<<<<<<<<<<<<<<<

                  CommandQueue #1

                  ------W>>>>>>>>>>>>------W>>>>>>>>>>>>------W>>>>>>>>>>>-------W>>>>>>>>>>>>----

                  Write/Read operation 2 queue; profiling enabled using AHP: 7.76999 GB/s

                   

                  # aticonfig --pplib-cmd "get activity"

                  Current Activity is Core Clock: 950MHZ

                  Memory Clock: 1425MHZ

                  VDDC: 1170

                  Activity: 63 percent

                  Performance Level: 2

                  Bus Speed: 5000

                  Bus Lanes: 16

                  Maximum Bus Lanes: 16

                    • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
                      sajis997

                      Hi forum,

                       

                      I am going through the attached source code. The following snippet is not clear to me .

                       

                      Inside the ProfileQueue::findMinMax(...)

                       

                      {

                       

                      }

                       

                      you are calculating the times taken by each operation (read, write or kernel execution)

                       

                      [code]


                      clGetEventProfilingInfo(events_[op][0], CL_PROFILING_COMMAND_START,




                      sizeof(cl_long), &time, NULL);

                      if (0 == *min_time)

                      {

                          *min_time = time;

                      }

                      else

                      {

                          *min_time = std::min<cl_long>(*min_time, time);

                      }

                      clGetEventProfilingInfo(events_[op][events_[op].size() - 1],




                      CL_PROFILING_COMMAND_END, sizeof(cl_long), &time, NULL);

                      if (0 == *max_time)

                      {

                          *max_time = time;

                      }

                      else

                      {

                          *max_time = std::max<cl_long>(*max_time, time);

                      }

                       

                      [/code]

                       

                       

                      As you see you are considering two separate events , should it not be the same event ?

                       

                      Some explanation would be helpful.

                       

                       

                      Regards

                      Sajjad

                        • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
                          himanshu.gautam

                          events_ is a 2D array of cl_events. The function findminmax needs to find the total time taken by all the events together in all the command queues, so we take the min of START time of the first command in all queues(events[op][0]), and the max of the END time for the last command in all queues(events[op][size]).

                          • Re: Asynchronous DMA  + Kernel Execution using AMD GPUs
                            kd2

                            My two cents is not to get too tied up with the profiling setup and the display of the profiling results of that original code. The profiling was not the major point of the code, but it takes a lot of effort to get past the profiling in that code. The point is that in the GCN chips, it is really simple to asynchronously use three things at once: the dual dma engines and the kernel execution. I had to strip down the code before I understood that there's not much to it -- I'll attach.

                             

                            For example,

                            ./a.out

                            ...

                            TEST use_kernel 1, n_bufs 3

                            Write/Kernel/Read  3 queues,  ALLOC_HOST,  6.34 GB/s.

                               0: W   0.2-  2.6  2.5 ms, X   8.5-  9.3  0.7 ms, R   9.5- 12.3  2.8 ms,

                               1: W   2.6-  5.1  2.4 ms, X   9.4- 10.1  0.7 ms, R  12.4- 17.5  5.1 ms,

                               2: W   5.1-  7.6  2.6 ms, X  10.3- 11.0  0.7 ms, R  18.0- 22.7  4.8 ms,

                               3: W  12.4- 17.5  5.1 ms, X  18.2- 19.0  0.7 ms, R  22.8- 27.9  5.1 ms,

                               4: W  17.6- 22.7  5.1 ms, X  22.9- 23.7  0.7 ms, R  27.9- 33.0  5.0 ms,

                            ...

                            The printout is the time in milliseconds of each Read, eXecution and Write (from the time of the first event's queueing). So for example, using 3 command queues, you see that while we're using one DMA engine to read the results of dataset #2 (18.0-22.7ms), the kernel is executing on dataset #3 (18.2-19.0ms), and we're using the second DMA engine to write dataset #4 (17.6-22.7ms).