12 Replies Latest reply on May 29, 2012 1:09 PM by sh

    Long time between enqueue and execute on GPU

    thesmileman

      The profiler is showing that it takes about 0.2ms from the time clEnqueueNDRangeKernel is finished until the actual kernel starts. This really really adds up. I know you should queue up lots of kernels but that often isn't an option. Is this going to be fixed anytime soon?

        • Re: Long time between enqueue and execute on GPU
          jeff_golds

          There will be a delay unless you flush the commands to the GPU immediately.  That means issuing a blocking call such as clFinish().  The actual batch overhead is less than 60us on most systems.  Per dispatch (i.e. each clEnqueueNDRangeKernel call) overhead is less than 8us.

          • Re: Long time between enqueue and execute on GPU
            krrishnarraj

            Just to add-up to the same problem I face. I have a compute intensive kernel. Previously I used to profile using clock_gettime() function on linux. Then I started using clGetEventProfilingInfo() . There is a huge timing difference when I use CL_PROFILING_COMMAND_SUBMIT & CL_PROFILING_COMMAND_START as start_time. The timing dump is as follows:

             

            Using:

            err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_SUBMIT, sizeof(cl_ulong), &startTime, NULL );

            err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &stopTime, NULL );

             

            Iteration 1 takes 8.625875 ms

            Iteration 2 takes 10.488933 ms

            Iteration 3 takes 8.605963 ms

            Iteration 4 takes 8.602357 ms

            Iteration 5 takes 12.417702 ms

            Iteration 6 takes 8.600140 ms

            Iteration 7 takes 8.576233 ms

            Iteration 8 takes 12.780842 ms

            Iteration 9 takes 8.640809 ms

            Iteration 10 takes 8.582085 ms

            Iteration 11 takes 14.690402 ms

            Iteration 12 takes 8.643072 ms

            Iteration 13 takes 8.581202 ms

            Iteration 14 takes 14.774370 ms

            Iteration 15 takes 8.648169 ms

            Iteration 16 takes 17.066158 ms

            Iteration 17 takes 8.671902 ms

            Iteration 18 takes 8.562560 ms

            Iteration 19 takes 14.703254 ms

            Iteration 20 takes 8.661131 ms

            Iteration 21 takes 8.571542 ms

            Iteration 22 takes 13.148671 ms

            Iteration 23 takes 8.639865 ms

            Iteration 24 takes 8.591887 ms

            Iteration 25 takes 14.783643 ms

             

            Using:

            err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &startTime, NULL );

            err = clGetEventProfilingInfo( timeEvent, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &stopTime, NULL );

             

            Iteration 1 takes 7.775704 ms

            Iteration 2 takes 7.769630 ms

            Iteration 3 takes 7.784592 ms

            Iteration 4 takes 7.757333 ms

            Iteration 5 takes 7.775111 ms

            Iteration 6 takes 7.769630 ms

            Iteration 7 takes 7.754963 ms

            Iteration 8 takes 7.765926 ms

            Iteration 9 takes 7.762370 ms

            Iteration 10 takes 7.785481 ms

            Iteration 11 takes 7.768297 ms

            Iteration 12 takes 7.769778 ms

            Iteration 13 takes 7.781185 ms

            Iteration 14 takes 7.763111 ms

            Iteration 15 takes 7.776444 ms

            Iteration 16 takes 7.775704 ms

            Iteration 17 takes 7.766667 ms

            Iteration 18 takes 7.767852 ms

            Iteration 19 takes 7.769629 ms

            Iteration 20 takes 7.762222 ms

            Iteration 21 takes 7.760149 ms

            Iteration 22 takes 7.775259 ms

            Iteration 23 takes 7.747111 ms

            Iteration 24 takes 7.766815 ms

            Iteration 25 takes 7.770667 ms

             

             

            I know there should be a delay in kernel setup, but i expected it to be in micro-seconds. Here there is a difference of almost a milli-second.

            Also, what is the reason for burst-timings in first case?

              • Re: Long time between enqueue and execute on GPU
                notzed

                You guys all seem to have totally unrealistic expectations here, which seems to be derived from a lack of understanding about fundamental computer and operating systems architecture.  e.g. the first timing set is bursty because you're not running on a real-time operating system and other events can get in the way (and as Jeff enlightened us, clFlush() isn't even a synchronous call on amd's driver).

                 

                And the whole reason these things have queues and buffers and flush calls is that a complete round-trip on it's own is known to be expensive and limited by the laws of physics (e.g. RC delays in switching) as well as operating system overheads.

                 

                Timing 'enqueue' to 'complete' is like saying it takes 6 months (say) to make every car on the road because it takes that long to dig the oil, iron ore and coal out of the ground, make the steel, manufacture all the parts, and roll it off the production line.  But it doesn't really - it takes about 15 minutes (or whatever) to roll out each new one in turn because there's a whole lot of production 'in the pipeline'.

                 

                This pipe-lining is pervasive throughout every level of every system, in computer from memory fetch to instruction execution to i/o: opencl isn't the only api which exposes this to the programmer (and other complex systems too, from manufacturing, to biological systems).  And even with AMD's plans for unifying the processing space, it will always still be a bit more expensive to move jobs around devices - and the devices themselves will be faster too so it still may be a significant portion of time; so the problem is always there even if you suddenly don't really care about it because it seems fast enough for your own problem.

                 

                A more telling metric is how long it takes to launch one kernel after the previous one has finished - after both were added to the queue before a flush or synchronised call.  If that was in the ms, i'd be worried ...

                 

                That's why marketing material of any data transfer mechanism (e.g. networking) always talks about 'bandwidth', and not 'latency' - the latency is an issue in some algorithms, but in general the bandwidth is much more important.  Not every algorithm will fit the restricted execution model of the gpu efficiently, and the latency is just one restriction software design must accommodate.  Many algorithms can be changed to hide it, or other work executed concurrently to utilise the otherwise idle time (e.g. running the graphics on a shared device).  But others are fully exposed to it.

                 

                You can't just replace a single matlab routine with a call to opencl and expect miracles.

                • Re: Long time between enqueue and execute on GPU
                  drallan

                  know there should be a delay in kernel setup, but i expected it to be in micro-seconds. Here there is a difference of almost a milli-second.

                  Also, what is the reason for burst-timings in first case?

                   

                  The extra burst times are probably from the OS updating the video display as it happens every 30 ms.

                  I wonder if your extra 800 us is for transferring data used by your kernel. It's quite uniform. Try using 1/2 the data and see what happens.

                • Re: Long time between enqueue and execute on GPU
                  thesmileman

                  Note for AMD: Why do posts get set "Assumed Answered" when there hasn't been a single answer marked as helpful or correct? In fact their hasn't been an attempt at answering the original issue except to tell me to flush the queue which was already happening.

                  • Re: Long time between enqueue and execute on GPU
                    ikuralenok

                    We experience similar issues with kernel startup time. We've tried AMD 7970 vs GeForce 590 vs GeForce 680 configuration and found that computation intensive operations are way faster on new AMD architecture (which is much more suitable for our tasks than Kepler). But all the time we've saved on computation is then wasted on startups of cheaper kernels. In our situation this mean that we have to invest in older nVidia Tesla cards instead of waiting FirePro based on new architecture. In our particular case we speak about ~200 professional cards. Let this number be a lower bound esteem on how much this problem does really cost. Can we expect any concrete feedback from AMD developers on the issue?

                    • Re: Long time between enqueue and execute on GPU
                      sh

                      I could confirm it. Launch latency is about 200us and constant CPU overhead is about 20us per call.

                      Could we expect some progress with SI and new runtime?