cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

kevinmanhn
Journeyman III

Enqueue from device (Dynamic parallelism) seem to be slower than kernel invoking from host?

Dear every people,

I have been testing with the new feature of OpenCl 2.0 - Dynamic Parallelism (Kernel enqueue directly from device). I run the test on Ubuntu 14.04, computer APU 7850K which is compatiple with OpenCL 2.0. I have two testing versions: version 1 has many some kernels that are invoking from host, version 2 is similar but the kernels are invoking directly from GPU device. Everything worked fine and I got the execution time (measured by wall clock time) for those kernels.

I hoped that the version 2 would perform better than the version 1. However, it turned out that the version 2 (Device enqueue) ran slower than version 1.

Here is my result:

CL_QUEUE_SIZE is 131072
Using platform: Advanced Micro Devices, Inc. and device: Hawaii.
OpenCL Device info:
CL_DEVICE_VENDOR            :Advanced Micro Devices, Inc.
CL_DEVICE_NAME                :Hawaii
CL_DRIVER_VERSION            :1642.5 (VM)
CL_DEVICE_PROFILE            :FULL_PROFILE
CL_DEVICE_VERSION            :OpenCL 2.0 AMD-APP (1642.5)
CL_DEVICE_OPENCL_C_VERSION        :OpenCL C 2.0
CL_DEVICE_MAX_COMPUTE_UNITS        :      44
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    :       3
CL_DEVICE_MAX_WORK_ITEM_SIZES        :    (  256,   256,   256)%
CL_DEVICE_MAX_WORK_GROUP_SIZE        :     256
CL_DEVICE_MEM_BASE_ADDR_ALIGN        :    2048
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    :     128
CL_DEVICE_MAX_CLOCK_FREQUENCY        :    1030
CL_DEVICE_LOCAL_MEM_SIZE        :     32768.0
CL_DEVICE_MAX_MEM_ALLOC_SIZE        :2503999488.0


Version 1 - Host Enqueue

Execution Time = 0.001694 s of Matrix size = 128x128
Execution Time = 0.001822 s of Matrix size = 256x256
Execution Time = 0.004094 s of Matrix size = 512x512
Execution Time = 0.010669 s of Matrix size = 1024x1024
Execution Time = 0.017005 s of Matrix size = 2048x2048
Execution Time = 0.031004 s of Matrix size = 4096x4096


Version 2 - Device Enqueue
Execution Time = 0.005847 s of Matrix size = 128x128
Execution Time = 0.002678 s of Matrix size = 256x256
Execution Time = 0.006043 s of Matrix size = 512x512
Execution Time = 0.012492 s of Matrix size = 1024x1024
Execution Time = 0.038608 s of Matrix size = 2048x2048
Execution Time = 0.071338 s of Matrix size = 4096x4096


Can anyone please explain me why?

Thank you.

Kevin.


0 Likes
6 Replies
dipak
Big Boss

Hi Kevin,

In general, it is expected that applications should experience better performance using device-side enqueue than multiple host-side enqueue. Can you please  share your test-case (host + device code) with us?

BTW, what was your test environment [once you mentioned APU 7850K but above clinfo shows Hawaii]? Do you see similar observation on other setup - say, with other GPU, or other version of OpenCL2.0 driver or on Windows OS?

Regards,

Hi,

I have a simple code to test the device enqueue performance.

I ran it on Ubuntu 14.04, spectre, opencl 2.0, APU 7850K, 16G.

The test is about adding array: C =A + B. Each kernel does 1 item of the array C[iteration] = A[iteration] + B[iteration]

The result as follows:

CL_QUEUE_SIZE is 131072

Using platform: Advanced Micro Devices, Inc. and device: Spectre.

OpenCL Device info:

CL_DEVICE_VENDOR            :Advanced Micro Devices, Inc.

CL_DEVICE_NAME                :Spectre

CL_DRIVER_VERSION            :1642.5 (VM)

CL_DEVICE_PROFILE            :FULL_PROFILE

CL_DEVICE_VERSION            :OpenCL 2.0 AMD-APP (1642.5)

CL_DEVICE_OPENCL_C_VERSION        :OpenCL C 2.0

CL_DEVICE_MAX_COMPUTE_UNITS        :       8

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    :       3

CL_DEVICE_MAX_WORK_ITEM_SIZES        :    (  256,   256,   256)%

CL_DEVICE_MAX_WORK_GROUP_SIZE        :     256

CL_DEVICE_MEM_BASE_ADDR_ALIGN        :    2048

CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    :     128

CL_DEVICE_MAX_CLOCK_FREQUENCY        :     720

CL_DEVICE_LOCAL_MEM_SIZE        :     32768.0

CL_DEVICE_MAX_MEM_ALLOC_SIZE        : 211550208.0

Test host enqueue

Execution time = 0.005345 s for 1000 host enqueue times

The result is correct

====================================================================

Test device enqueue

Execution time = 0.247045 s for 1000 device enqueue times

The result is correct

End of program

Please see the attached, and explain me the reason why device enqueue is slower than host enqueue version.

Thank you.

Kevin.

0 Likes

Hi,

I also tested the RegionGrowingSegmentation samples from AMD on My computer Ubuntu, 14.04, spectre, OpenCL 2.0, APU 7850K, 16G.

The result proves that the device enqueue runs slower than the host enqueue. I am doing the research related to the OpenCL Dynamic Parallelism. so,

Please let me know the reason why the new feature runs slower than old feature.

Thank you,

Kevin

Here is the result:

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : Spectre Device ID is 0x15a8a20

Build Options are : -I. -cl-std=CL2.0

Device enqueue of kernels...

| Width | Height | Avg. Kernel Time (sec) | Pixels/sec |

|-------|--------|------------------------|------------|

| 480   | 480    | 0.529234               | 435346     |

kevin@kevinPC:~/RegionGrowingSegmentation$ ./RegionGrowingSegmentation -t -o

Platform 0 : Advanced Micro Devices, Inc.

Image Size [Width = 480 , Height = 480]

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : Spectre Device ID is 0x1b76a20

Build Options are : -I. -cl-std=CL2.0

Host enqueue of kernels...

| Width | Height | Avg. Kernel Time (sec) | Pixels/sec |

|-------|--------|------------------------|------------|

| 480   | 480    | 0.486141               | 473936     |

0 Likes

Hi,
I also tested the Region Growing sample on Windows 8.1, and computer A10-7850K (3.7GHz) processor with 16GB of RAM. It turned out that both Device Enqueue and Host Enqueue have similar performance. However, in the article (http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/), the Device Enqueue have 3 times faster.

Could you please explain me?
Thank you,

Kevin.

Here are the results:

Platform 0 : Advanced Micro Devices, Inc.
Image Size [Width = 480 , Height = 480]
Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Spectre Device ID is 000000DAC47C3A10
Build Options are : -I. -cl-std=CL2.0
Host enqueue of kernels…

| Width | Height | Avg. Kernel Time (sec) | Pixels/sec |
|———|——–|————————–|—————|
| 480 | 480 | 0.524794 | 439029 |

////////////////////////////////////////////////////////////////////////////////////////////////////////

Platform 0 : Advanced Micro Devices, Inc.
Image Size [Width = 480 , Height = 480]
Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Spectre Device ID is 00000081609E3A00
Build Options are : -I. -cl-std=CL2.0
Device enqueue of kernels…

| Width | Height | Avg. Kernel Time (sec) | Pixels/sec |
|——-|——–|————————|————|
| 480 | 480 | 0.524605 | 439187 |

0 Likes

Thanks for sharing the code. I'll check and get back to you.

Regards,

0 Likes

Hi,

I think I have a very similar problem.

During the last week, I tried to optimize an OpenCL application. The application (Connected Component Labeling) has to enqueue two kernels (one for doing some work and another to check for changes) until a status bit is set.

At this point I tried to use the device side enqueue feature introduced by OpenCL2.0, to allow the GPU (Radeon R9 290 on Win7, with current omega drivers and AMD APP SDK 3.0 beta) to enqueue the kernel by itself until it's finished.

The device side enqueue version is now working fine and is producing correct results, but I noticed while profiling it, that it is much slower than the GPU-CPU version.

I also noticed in CodeXL, that the occupancy of the new kernel within the "device side enqueue version" is much lower (50%) than the occupancy of the two separated kernels scheduled in a loop by CPU (each one 100%). CodeXL tells me that this is caused by limited waves due to high VGPR usage.

Are there any new insights regarding to the much slower device side enqueue feature?

Best Regards,

markus