cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ravikeshri
Journeyman III

Command launch overhead

How many commands should be queued in the command queue to minimize the command launch overhead?

Thanks,

Ravi

0 Likes
8 Replies
himanshu_gautam
Grandmaster

There is no limit on the number of commands in a commandqueue. It is only restricted by the system memory.

If you exceed this limit you are likely to get a CL_OUT_OF_HOST_MEMORY.

I hope it is clear

0 Likes

Himashu, Thanks.

I understand we can enqueue as many commands as we want (given system memory is available). However, I was experiencing delay in the kernel launch time (Start – Queue). The Programming guide says "To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch."

I wanted to know this batch size. So I can enqueue enough number of kernels to minimize the delay as a work-around.

0 Likes
Meteorhead
Challenger

Again, let me recycle a topic with a name relevant to my problem. I would test my application with CodeXL to see what's happening, but there is no VS2012 plug-in available yet. (Eagerly awaiting the update) Therefore I ask for a little input from the community:

I have updated my ongoing simple sample project making use of multi-device calculations with sub-buffers and single host thread control with the 2.8 cl.hpp headers, but there are some strange time readings I cannot grasp why they happen. I have guesses, but they are nowhere documented (at least that I know of) why they happen:

The sample ought to work with multiple platforms installed, but I have not tested it yet. Single AMD platform works with the aforementioned glitches. I enumerate all devices, create a single context to hold them all, create a program for the context compile, create the kernel, then initalize two simple datasets for input and allocate one for output, create disjunct (non-overlapping) sub-buffers for each device, and all of this stuff on a per-platform basis. All this should suffice for proper multi-device usage even when having Intel+NV compilers installed, or AMD+NV mixed. (Intel+AMD right now makes the mistake of using the CPU twice, hence the simplicity of the sample) Once the sub-buffers are created, they are migrated to their respective devices to avoid moving of data for the initial kernel launch (I am aware that this is not the intented aim of migration, but this is a test application meant for understanding the stuff under the hood)

Devices in my notebook are a Core-i5 430M and a Mobility HD5870 GPU.

  • There is an initial kernel launch which launches kernels in parallel and wait for all of them to finish in a single call of cl::WaitForEvents. (The seemingly multiple call is due to the fact that cl::Events originating from different platforms most certainly cannot be synchronized, but having just AMD platform, this will result in a single call) Here the GPU greatly falls behind the CPU in execution time, although data should already be present on the device. I guess this is because the data might be present, but not the kernel, which initially must be submitted to the device. Even if this were the case, I would expect this overhead to be invisible when queried through cl::Event END-START times.
  • The second kernel launch is made seperately. One device executes, host waits for completion and only then it starts kernels on subsequent devices. In this case timings turn out as expected.
  • The third kernel launch is identical to the first, but this time the GPU finishes in the same amount of time the CPU does, despite the fact that they should be working concurrently. I cannot tell if this is a sync issue, or there is hidden memory movement involved. The first I highly doubt, and the latter should not happen, as sub-buffers do not overlap. If I had to think of something I'd say that when the kernel functor sets arguments when it is called, it resets some state of the buffers which results in unwanted memory movement.

Here are the timings from my notebook:

Found platform: Advanced Micro Devices, Inc.

        Found device: Juniper

        Found device: Intel(R) Core(TM) i5 CPU       M 430  @ 2.27GHz

Loading kernel file.

Building kernels... done.

Initializing input1_vector... done.

Initializing input2_vector... done.

Creating region: origin=0       end=4194304

Creating region: origin=4194304 end=8388608

Initial kernel launching.

Finished!

Operation took: 70103 microseconds.

Juniper finished in 67864 microseconds.

Intel(R) Core(TM) i5 CPU       M 430  @ 2.27GHz finished in 21560 microseconds.

Launching kernels seperately.

Finished!

Operation took: 22022 microseconds.

Juniper finished in 1074 microseconds.

Intel(R) Core(TM) i5 CPU       M 430  @ 2.27GHz finished in 19860 microseconds.

Launching kernels simultanously.

Finished!

Operation took: 22021 microseconds.

Juniper finished in 21714 microseconds.

Intel(R) Core(TM) i5 CPU       M 430  @ 2.27GHz finished in 21826 microseconds.

Could someone tell me what is wrong behind the scenes? Why is the first launch so long (even measuring solely execution time)? Why does the third launch result in bound exec times when it is made the same way the first launch was made?

0 Likes

There are several problem with CodeXL and profiling in OpenCL. When I tested two queues on single devices CodeXL reported that they were executed in parallel. Time of start or end of kernel execution was clearly wrong. So profiling may not be reliable.

0 Likes

Thank you nou. So for now the safest (and easiest) thing to do is not to believe profiling?

0 Likes

Strangely my linux box shows more predictable (and reasonable) runtimes.

[user@gpu001 bin]$ ./main.x86_64

Found platform: Advanced Micro Devices, Inc.

        Found device: Cypress

        Found device: Cypress

        Found device: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz

Loading kernel file.

Building kernels... done.

Initializing input1_vector... done.

Initializing input2_vector... done.

Creating region: origin=0       end=4194304

Creating region: origin=4194304 end=8388608

Creating region: origin=8388608 end=12582912

Initial kernel launching.

Finished!

Operation took: 86537 microseconds.

Cypress finished in 71020 microseconds.

Cypress finished in 71151 microseconds.

Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz finished in 8230 microseconds.

Launching kernels seperately.

Finished!

Operation took: 5885 microseconds.

Cypress finished in 459 microseconds.

Cypress finished in 460 microseconds.

Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz finished in 4293 microseconds.

Launching kernels simultanously.

Finished!

Operation took: 4601 microseconds.

Cypress finished in 462 microseconds.

Cypress finished in 456 microseconds.

Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz finished in 4454 microseconds.

[user@gpu001 bin]$

Both the Windows notebook and the linux node use Catalyst 12.10 and SDK 2.8.

0 Likes

Looks like the "first" run on the OpenCL device takes more time -- possibly due to some initialization.

Thats just my guess.

On your linux run as well, Intel device takes almost double the time than the other runs - Thats the rationale behind my guess

0 Likes

Hi Meteorhead,

Can you please try these suggestions, which might help:

1. Do not use event profiling info, for measuring the time taken by kernels. It may be unreliable, instead use some standard system timers.

2. Try running the kernels for some iterations (instead of just once), to get more reliable timings.

3. In order to nullify the effect of data transfer and GPU warmup , have a dummy kernel call before starting your tests.

0 Likes