Dear every people,
I have been testing with the new feature of OpenCl 2.0 - Dynamic Parallelism (Kernel enqueue directly from device). I run the test on Ubuntu 14.04, computer APU 7850K which is compatiple with OpenCL 2.0. I have two testing versions: version 1 has many some kernels that are invoking from host, version 2 is similar but the kernels are invoking directly from GPU device. Everything worked fine and I got the execution time (measured by wall clock time) for those kernels.
I hoped that the version 2 would perform better than the version 1. However, it turned out that the version 2 (Device enqueue) ran slower than version 1.
Here is my result:
CL_QUEUE_SIZE is 131072
Using platform: Advanced Micro Devices, Inc. and device: Hawaii.
OpenCL Device info:
CL_DEVICE_VENDOR :Advanced Micro Devices, Inc.
CL_DRIVER_VERSION :1642.5 (VM)
CL_DEVICE_VERSION :OpenCL 2.0 AMD-APP (1642.5)
CL_DEVICE_OPENCL_C_VERSION :OpenCL C 2.0
CL_DEVICE_MAX_COMPUTE_UNITS : 44
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES : ( 256, 256, 256)%
CL_DEVICE_MAX_WORK_GROUP_SIZE : 256
CL_DEVICE_MEM_BASE_ADDR_ALIGN : 2048
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE : 128
CL_DEVICE_MAX_CLOCK_FREQUENCY : 1030
CL_DEVICE_LOCAL_MEM_SIZE : 32768.0
Version 1 - Host Enqueue
Execution Time = 0.001694 s of Matrix size = 128x128
Execution Time = 0.001822 s of Matrix size = 256x256
Execution Time = 0.004094 s of Matrix size = 512x512
Execution Time = 0.010669 s of Matrix size = 1024x1024
Execution Time = 0.017005 s of Matrix size = 2048x2048
Execution Time = 0.031004 s of Matrix size = 4096x4096
Version 2 - Device Enqueue
Execution Time = 0.005847 s of Matrix size = 128x128
Execution Time = 0.002678 s of Matrix size = 256x256
Execution Time = 0.006043 s of Matrix size = 512x512
Execution Time = 0.012492 s of Matrix size = 1024x1024
Execution Time = 0.038608 s of Matrix size = 2048x2048
Execution Time = 0.071338 s of Matrix size = 4096x4096
Can anyone please explain me why?