5 Replies Latest reply on Jun 11, 2011 1:54 PM by himanshu.gautam

    kernel launch time way too long

    laughingrice

      I'm trying to convert some code to OpenCL under the e350 (brazos) architecture with windows 7 64bit (although the application is 32bit).

      Timing the following three lines of code takes about 1.2ms

      size_t globalSz[2] = {320, 240};

       

      clEnqueueNDRangeKernel(queue, Test, 2, NULL, globalSz, NULL, 0, NULL, NULL);

      clFinish(queue);



      I believe that kernel launch overhead should be more in the area of 30us, so I'm guessing that I'm doing something wrong. The kernel itself is just the empty kernel so that I'm only supposed to be seeing the kernel lauch overhead. The original C code runs for about 1ms, so 1.2 ms kernel lauch overhead is unacceptable. Any ideas what I may be doing wrong?

       

      Thanks

       

        • kernel launch time way too long
          jeff_golds

          You can't base performance off a single launch as there are a lot of one time costs.  For example, in Windows buffers aren't allocated on the GPU until the GPU uses them, so the first time you touch a resource, there is a higher cost.

          Jeff

            • kernel launch time way too long
              laughingrice

              I know that. its not only buffer allocation but can also be the final compilation. in this case though there are no buffers and what i posted was the best run time out of 4. All three others were worse.

                • kernel launch time way too long
                  himanshu.gautam

                  laughingrice,

                  I suggest you to try to time the region enclosed within lines and find average time.

                  clFinish(cmdQueue);

                  --------------------------------------------

                  for(int i =0;i<10000;i++)

                  {

                  clEnqueueNDRangeKernel(...);

                  }

                  clFinish(cmdQueue);

                  ---------------------------------------------

                  I hope you will see some improvement.

                  Thanks

                    • kernel launch time way too long
                      laughingrice

                      Tried that

                       

                      Ver1:

                      ------------

                      for(int i =0;i<1000;i++)

                      {

                      clEnqueueNDRangeKernel(...);

                      }

                      clFinish(cmdQueue);

                      ------------

                      Ver2:

                       

                      for(int i =0;i<1000;i++)

                      {

                      clEnqueueNDRangeKernel(...);

                      clFinish(cmdQueue);

                      }

                      ------------

                       

                      Average run time for version one is ~900us. For version two it's down to 46ms/1000 = 46us, which is more acceptable, but only if I have a LOT of kernels that don't require synchronization

                       

                      Running the same code on a NVIDIA Tesla c1060 that I have lying around, ver1 takes about 100us per lauch (compared to 40us for CUDA) and ver 2 takes 36ms/1000 = 36us. Admitedly though it's a much more powerful computer.

                       

                      Is anyone else seeing these launch times?

                       

                      Thanks

                        • kernel launch time way too long
                          himanshu.gautam

                          okay so it seems you are getting acceptable values in the second case. 

                          Although I was expecting to get better result for version 1 where  kernels would be dispatched in batches rather than one after other.

                          I hope you put a clFinish before starting the timer in the first version.

                          Also I prefer to use some reliable system timers or profiler instead of querying time using cl_event objects as there is some overhead for them itself.

                          And improvement in kernel lauch time is a known feature request and I think you will find it improved as compared to previous AMD APP SDK versions. You can expect further improvements in future releases.

                          Thanks