AnsweredAssumed Answered

Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?

Question asked by kevinmanhn on Jan 7, 2015
Latest reply on Mar 9, 2015 by markuswirth

Dear every people,

I have been testing with the new feature of OpenCl 2.0 - Dynamic Parallelism (Kernel enqueue directly from device). I run the test on Ubuntu 14.04, computer APU 7850K which is compatiple with OpenCL 2.0. I have two testing versions: version 1 has many some kernels that are invoking from host, version 2 is similar but the kernels are invoking directly from GPU device. Everything worked fine and I got the execution time (measured by wall clock time) for those kernels.

I hoped that the version 2 would perform better than the version 1. However, it turned out that the version 2 (Device enqueue) ran slower than version 1.

Here is my result:

CL_QUEUE_SIZE is 131072
Using platform: Advanced Micro Devices, Inc. and device: Hawaii.
OpenCL Device info:
CL_DEVICE_VENDOR            :Advanced Micro Devices, Inc.
CL_DEVICE_NAME                :Hawaii
CL_DRIVER_VERSION            :1642.5 (VM)
CL_DEVICE_PROFILE            :FULL_PROFILE
CL_DEVICE_VERSION            :OpenCL 2.0 AMD-APP (1642.5)
CL_DEVICE_OPENCL_C_VERSION        :OpenCL C 2.0
CL_DEVICE_MAX_COMPUTE_UNITS        :      44
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    :       3
CL_DEVICE_MAX_WORK_ITEM_SIZES        :    (  256,   256,   256)%
CL_DEVICE_MAX_WORK_GROUP_SIZE        :     256
CL_DEVICE_MEM_BASE_ADDR_ALIGN        :    2048
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    :     128
CL_DEVICE_MAX_CLOCK_FREQUENCY        :    1030
CL_DEVICE_LOCAL_MEM_SIZE        :     32768.0
CL_DEVICE_MAX_MEM_ALLOC_SIZE        :2503999488.0


Version 1 - Host Enqueue

Execution Time = 0.001694 s of Matrix size = 128x128
Execution Time = 0.001822 s of Matrix size = 256x256
Execution Time = 0.004094 s of Matrix size = 512x512
Execution Time = 0.010669 s of Matrix size = 1024x1024
Execution Time = 0.017005 s of Matrix size = 2048x2048
Execution Time = 0.031004 s of Matrix size = 4096x4096


Version 2 - Device Enqueue
Execution Time = 0.005847 s of Matrix size = 128x128
Execution Time = 0.002678 s of Matrix size = 256x256
Execution Time = 0.006043 s of Matrix size = 512x512
Execution Time = 0.012492 s of Matrix size = 1024x1024
Execution Time = 0.038608 s of Matrix size = 2048x2048
Execution Time = 0.071338 s of Matrix size = 4096x4096


Can anyone please explain me why?

Thank you.

Kevin.


Outcomes