How many commands should be queued in the command queue to minimize the command launch overhead?
There is no limit on the number of commands in a commandqueue. It is only restricted by the system memory.
If you exceed this limit you are likely to get a CL_OUT_OF_HOST_MEMORY.
I hope it is clear
I understand we can enqueue as many commands as we want (given system memory is available). However, I was experiencing delay in the kernel launch time (Start – Queue). The Programming guide says "To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch."
I wanted to know this batch size. So I can enqueue enough number of kernels to minimize the delay as a work-around.
Again, let me recycle a topic with a name relevant to my problem. I would test my application with CodeXL to see what's happening, but there is no VS2012 plug-in available yet. (Eagerly awaiting the update) Therefore I ask for a little input from the community:
I have updated my ongoing simple sample project making use of multi-device calculations with sub-buffers and single host thread control with the 2.8 cl.hpp headers, but there are some strange time readings I cannot grasp why they happen. I have guesses, but they are nowhere documented (at least that I know of) why they happen:
The sample ought to work with multiple platforms installed, but I have not tested it yet. Single AMD platform works with the aforementioned glitches. I enumerate all devices, create a single context to hold them all, create a program for the context compile, create the kernel, then initalize two simple datasets for input and allocate one for output, create disjunct (non-overlapping) sub-buffers for each device, and all of this stuff on a per-platform basis. All this should suffice for proper multi-device usage even when having Intel+NV compilers installed, or AMD+NV mixed. (Intel+AMD right now makes the mistake of using the CPU twice, hence the simplicity of the sample) Once the sub-buffers are created, they are migrated to their respective devices to avoid moving of data for the initial kernel launch (I am aware that this is not the intented aim of migration, but this is a test application meant for understanding the stuff under the hood)
Devices in my notebook are a Core-i5 430M and a Mobility HD5870 GPU.
Here are the timings from my notebook:
Found platform: Advanced Micro Devices, Inc.
Found device: Juniper
Found device: Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz
Loading kernel file.
Building kernels... done.
Initializing input1_vector... done.
Initializing input2_vector... done.
Creating region: origin=0 end=4194304
Creating region: origin=4194304 end=8388608
Initial kernel launching.
Operation took: 70103 microseconds.
Juniper finished in 67864 microseconds.
Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz finished in 21560 microseconds.
Launching kernels seperately.
Operation took: 22022 microseconds.
Juniper finished in 1074 microseconds.
Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz finished in 19860 microseconds.
Launching kernels simultanously.
Operation took: 22021 microseconds.
Juniper finished in 21714 microseconds.
Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz finished in 21826 microseconds.
Could someone tell me what is wrong behind the scenes? Why is the first launch so long (even measuring solely execution time)? Why does the third launch result in bound exec times when it is made the same way the first launch was made?
There are several problem with CodeXL and profiling in OpenCL. When I tested two queues on single devices CodeXL reported that they were executed in parallel. Time of start or end of kernel execution was clearly wrong. So profiling may not be reliable.
Thank you nou. So for now the safest (and easiest) thing to do is not to believe profiling?
Strangely my linux box shows more predictable (and reasonable) runtimes.
[user@gpu001 bin]$ ./main.x86_64
Found device: Cypress
Found device: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz
Creating region: origin=8388608 end=12582912
Operation took: 86537 microseconds.
Cypress finished in 71020 microseconds.
Cypress finished in 71151 microseconds.
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz finished in 8230 microseconds.
Operation took: 5885 microseconds.
Cypress finished in 459 microseconds.
Cypress finished in 460 microseconds.
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz finished in 4293 microseconds.
Operation took: 4601 microseconds.
Cypress finished in 462 microseconds.
Cypress finished in 456 microseconds.
Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz finished in 4454 microseconds.
Both the Windows notebook and the linux node use Catalyst 12.10 and SDK 2.8.
Looks like the "first" run on the OpenCL device takes more time -- possibly due to some initialization.
Thats just my guess.
On your linux run as well, Intel device takes almost double the time than the other runs - Thats the rationale behind my guess
Can you please try these suggestions, which might help:
1. Do not use event profiling info, for measuring the time taken by kernels. It may be unreliable, instead use some standard system timers.
2. Try running the kernels for some iterations (instead of just once), to get more reliable timings.
3. In order to nullify the effect of data transfer and GPU warmup , have a dummy kernel call before starting your tests.
Retrieving data ...