4 Replies Latest reply on May 19, 2014 2:09 PM by maxdz8

    OpenCL API execution time is too long

    obara

      My system frequently(10000 or more) executes the following OpenCL API.

      ----------------------------------------------

      clEnqueueTask(command_queue, kernel, 0, NULL,&event);  (1us)

      clWaitForEvents(1, &event);                          (100us)

      ----------------------------------------------

      __kernel void add(__global int* A, __global float* B, __global float* C)

      {

        *C = *A + *B;

      }

      ----------------------------------------------------

      But there is fatal defect that

      The OpenCL API execution time is too long.

      For example,

      clEnqueueTask API takes 1us/1call,

      the following clWaitForEvents API takes 100us/1call.

       

       

      How can I manage the API execution time.

        • Re: OpenCL API execution time is too long
          gopal

          Can you elaborate it little more? It is not clear with this what you want to ask.

           

          As per your data, it seems that clEnqueueTask() api call takes less time compare to clWaitForEvents() call. What reference you are saying that OpenCL execution time (in this case clEnqueueTask() timing) is too long?

            • Re: OpenCL API execution time is too long
              obara

              Hi ratul

              Think you for reply

               

              I explain background of my question in the following.

              I try to use APU(AMD A10-7850A) for DataBase Aggregate computations.

              So OpenCL APIs(like "clEnqueueTask") are called 100,000,000 times.

              GPGPU has excellent computing capability, but PCIe is bottleneck.

               

               

              From the background, I try to use APU(AMD A10-7850A).

              This time, I measured AMD A10-7850A performance for DataBase Aggregate computations.

              But current resultis that GPU in APU(AMD A10-7850A) is terribly worse than CPU.

              Because OpenCL-APIs takes much processing time.

               

               

              For example,

                  clCreateBuffer:40us

                  clSetKernelArg:30us

                  clEnqueueTask & clWaitForEvents:100us

               

               

              Our DB system calls these APIs 100,000,000 times.

              Compared with DataBase Aggregate computations estimated time,

              CPU/GPU complex system is 100 times slower than CPU only.

              I think it's caused by OpenCL-APIs processing time.

               

               

              I expected, in case of APU, OpenCL-APIs processing time is too small, but isn't.

              I want to know how to change OpenCL-APIs processing time the smaller,

              especially APU(AMD A10-7850A), I think there are many optimization.

                • Re: OpenCL API execution time is too long
                  gopal

                  1. @Because OpenCL-APIs takes much processing time.

                  For example,

                      clCreateBuffer:40us

                      clSetKernelArg:30us

                      clEnqueueTask & clWaitForEvents:100us

                   

                  First tell me how you are measuring these api calls time?


                  2. @ "CPU/GPU complex system is 100 times slower than CPU only. I think it's caused by OpenCL-APIs processing time."

                  secondly, how you are comparing the CPU and CPU/GPU execution time, i mean how you are calculating these times?

                   

                  Thanks,

              • Re: OpenCL API execution time is too long
                maxdz8

                Hello Obara, I've also noticed some overhead in kernel dispatch. I have a kernel which takes quite some time to run, it has an internal cycle on a known constant so I tinkered a bit with "unrolling the loop" host-side somehow.

                It turned out that I would saturate a core with about 1k dispatches per second (EnqueueNDRangeKernel).

                You probably cannot observe any high CPU usage due to Wait forcing a full stop-n-wait on GPU but this is very inefficient. You should absolutely try to "batch" (graphics jargon) more data in each call. It is my understanding clEnqueueTask should really not be used (it is deprecated in CL2 and removed from specification). If your kernel is a simple add, setting it up is going to take much, much more than just doing the work but I take for granted this is just an example.

                If you're using an in-order queue, just wait on the last task, you should already get some improvement with an accompanying CPU spike.