6 Replies Latest reply on Mar 9, 2015 9:13 AM by markuswirth

    Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?

    kevinmanhn

      Dear every people,

      I have been testing with the new feature of OpenCl 2.0 - Dynamic Parallelism (Kernel enqueue directly from device). I run the test on Ubuntu 14.04, computer APU 7850K which is compatiple with OpenCL 2.0. I have two testing versions: version 1 has many some kernels that are invoking from host, version 2 is similar but the kernels are invoking directly from GPU device. Everything worked fine and I got the execution time (measured by wall clock time) for those kernels.

      I hoped that the version 2 would perform better than the version 1. However, it turned out that the version 2 (Device enqueue) ran slower than version 1.

      Here is my result:

      CL_QUEUE_SIZE is 131072
      Using platform: Advanced Micro Devices, Inc. and device: Hawaii.
      OpenCL Device info:
      CL_DEVICE_VENDOR            :Advanced Micro Devices, Inc.
      CL_DEVICE_NAME                :Hawaii
      CL_DRIVER_VERSION            :1642.5 (VM)
      CL_DEVICE_PROFILE            :FULL_PROFILE
      CL_DEVICE_VERSION            :OpenCL 2.0 AMD-APP (1642.5)
      CL_DEVICE_OPENCL_C_VERSION        :OpenCL C 2.0
      CL_DEVICE_MAX_COMPUTE_UNITS        :      44
      CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    :       3
      CL_DEVICE_MAX_WORK_ITEM_SIZES        :    (  256,   256,   256)%
      CL_DEVICE_MAX_WORK_GROUP_SIZE        :     256
      CL_DEVICE_MEM_BASE_ADDR_ALIGN        :    2048
      CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    :     128
      CL_DEVICE_MAX_CLOCK_FREQUENCY        :    1030
      CL_DEVICE_LOCAL_MEM_SIZE        :     32768.0
      CL_DEVICE_MAX_MEM_ALLOC_SIZE        :2503999488.0


      Version 1 - Host Enqueue

      Execution Time = 0.001694 s of Matrix size = 128x128
      Execution Time = 0.001822 s of Matrix size = 256x256
      Execution Time = 0.004094 s of Matrix size = 512x512
      Execution Time = 0.010669 s of Matrix size = 1024x1024
      Execution Time = 0.017005 s of Matrix size = 2048x2048
      Execution Time = 0.031004 s of Matrix size = 4096x4096


      Version 2 - Device Enqueue
      Execution Time = 0.005847 s of Matrix size = 128x128
      Execution Time = 0.002678 s of Matrix size = 256x256
      Execution Time = 0.006043 s of Matrix size = 512x512
      Execution Time = 0.012492 s of Matrix size = 1024x1024
      Execution Time = 0.038608 s of Matrix size = 2048x2048
      Execution Time = 0.071338 s of Matrix size = 4096x4096


      Can anyone please explain me why?

      Thank you.

      Kevin.


        • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
          dipak

          Hi Kevin,

          In general, it is expected that applications should experience better performance using device-side enqueue than multiple host-side enqueue. Can you please  share your test-case (host + device code) with us?

          BTW, what was your test environment [once you mentioned APU 7850K but above clinfo shows Hawaii]? Do you see similar observation on other setup - say, with other GPU, or other version of OpenCL2.0 driver or on Windows OS?

           

          Regards,

            • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
              kevinmanhn
              Hi,

               

              I have a simple code to test the device enqueue performance.

              I ran it on Ubuntu 14.04, spectre, opencl 2.0, APU 7850K, 16G.

              The test is about adding array: C =A + B. Each kernel does 1 item of the array C[iteration] = A[iteration] + B[iteration]

              The result as follows:

              CL_QUEUE_SIZE is 131072

              Using platform: Advanced Micro Devices, Inc. and device: Spectre.

              OpenCL Device info:

              CL_DEVICE_VENDOR            :Advanced Micro Devices, Inc.

              CL_DEVICE_NAME                :Spectre

              CL_DRIVER_VERSION            :1642.5 (VM)

              CL_DEVICE_PROFILE            :FULL_PROFILE

              CL_DEVICE_VERSION            :OpenCL 2.0 AMD-APP (1642.5)

              CL_DEVICE_OPENCL_C_VERSION        :OpenCL C 2.0

              CL_DEVICE_MAX_COMPUTE_UNITS        :       8

              CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    :       3

              CL_DEVICE_MAX_WORK_ITEM_SIZES        :    (  256,   256,   256)%

              CL_DEVICE_MAX_WORK_GROUP_SIZE        :     256

              CL_DEVICE_MEM_BASE_ADDR_ALIGN        :    2048

              CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    :     128

              CL_DEVICE_MAX_CLOCK_FREQUENCY        :     720

              CL_DEVICE_LOCAL_MEM_SIZE        :     32768.0

              CL_DEVICE_MAX_MEM_ALLOC_SIZE        : 211550208.0

               

              Test host enqueue

               

              Execution time = 0.005345 s for 1000 host enqueue times

               

              The result is correct

              ====================================================================

              Test device enqueue

               

              Execution time = 0.247045 s for 1000 device enqueue times

               

              The result is correct

               

              End of program

               

              Please see the attached, and explain me the reason why device enqueue is slower than host enqueue version.

              Thank you.

              Kevin.

                • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
                  kevinmanhn

                  Hi,

                  I also tested the RegionGrowingSegmentation samples from AMD on My computer Ubuntu, 14.04, spectre, OpenCL 2.0, APU 7850K, 16G.

                  The result proves that the device enqueue runs slower than the host enqueue. I am doing the research related to the OpenCL Dynamic Parallelism. so,

                  Please let me know the reason why the new feature runs slower than old feature.

                  Thank you,

                  Kevin

                   

                  Here is the result:

                  Selected Platform Vendor : Advanced Micro Devices, Inc.

                  Device 0 : Spectre Device ID is 0x15a8a20

                  Build Options are : -I. -cl-std=CL2.0

                  Device enqueue of kernels...

                   

                  | Width | Height | Avg. Kernel Time (sec) | Pixels/sec |

                  |-------|--------|------------------------|------------|

                  | 480   | 480    | 0.529234               | 435346     |

                  kevin@kevinPC:~/RegionGrowingSegmentation$ ./RegionGrowingSegmentation -t -o

                  Platform 0 : Advanced Micro Devices, Inc.

                  Image Size [Width = 480 , Height = 480]

                  Platform found : Advanced Micro Devices, Inc.

                   

                  Selected Platform Vendor : Advanced Micro Devices, Inc.

                  Device 0 : Spectre Device ID is 0x1b76a20

                  Build Options are : -I. -cl-std=CL2.0

                  Host enqueue of kernels...

                   

                  | Width | Height | Avg. Kernel Time (sec) | Pixels/sec |

                  |-------|--------|------------------------|------------|

                  | 480   | 480    | 0.486141               | 473936     |

                    • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
                      kevinmanhn

                      Hi,
                      I also tested the Region Growing sample on Windows 8.1, and computer A10-7850K (3.7GHz) processor with 16GB of RAM. It turned out that both Device Enqueue and Host Enqueue have similar performance. However, in the article (http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/), the Device Enqueue have 3 times faster.

                      Could you please explain me?
                      Thank you,

                      Kevin.

                       

                      Here are the results:

                      Platform 0 : Advanced Micro Devices, Inc.
                      Image Size [Width = 480 , Height = 480]
                      Platform found : Advanced Micro Devices, Inc.

                      Selected Platform Vendor : Advanced Micro Devices, Inc.
                      Device 0 : Spectre Device ID is 000000DAC47C3A10
                      Build Options are : -I. -cl-std=CL2.0
                      Host enqueue of kernels…

                      | Width | Height | Avg. Kernel Time (sec) | Pixels/sec |
                      |———|——–|————————–|—————|
                      | 480 | 480 | 0.524794 | 439029 |

                      ////////////////////////////////////////////////////////////////////////////////////////////////////////

                      Platform 0 : Advanced Micro Devices, Inc.
                      Image Size [Width = 480 , Height = 480]
                      Platform found : Advanced Micro Devices, Inc.

                      Selected Platform Vendor : Advanced Micro Devices, Inc.
                      Device 0 : Spectre Device ID is 00000081609E3A00
                      Build Options are : -I. -cl-std=CL2.0
                      Device enqueue of kernels…

                      | Width | Height | Avg. Kernel Time (sec) | Pixels/sec |
                      |——-|——–|————————|————|
                      | 480 | 480 | 0.524605 | 439187 |

                    • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
                      dipak

                      Thanks for sharing the code. I'll check and get back to you.

                       

                      Regards,

                        • Re: Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?
                          markuswirth

                          Hi,

                          I think I have a very similar problem.

                          During the last week, I tried to optimize an OpenCL application. The application (Connected Component Labeling) has to enqueue two kernels (one for doing some work and another to check for changes) until a status bit is set.

                          At this point I tried to use the device side enqueue feature introduced by OpenCL2.0, to allow the GPU (Radeon R9 290 on Win7, with current omega drivers and AMD APP SDK 3.0 beta) to enqueue the kernel by itself until it's finished.

                          The device side enqueue version is now working fine and is producing correct results, but I noticed while profiling it, that it is much slower than the GPU-CPU version.

                          I also noticed in CodeXL, that the occupancy of the new kernel within the "device side enqueue version" is much lower (50%) than the occupancy of the two separated kernels scheduled in a loop by CPU (each one 100%). CodeXL tells me that this is caused by limited waves due to high VGPR usage.

                           

                          Are there any new insights regarding to the much slower device side enqueue feature?

                           

                          Best Regards,

                          markus