3 Replies Latest reply on Sep 17, 2015 6:38 AM by dipak

    What is more efficent ....




      I wrote a simple opencl program, where a some kernels are executed in a loop. The loop is shown in the code snippet below. The kernel verletstep1 sets the variable

      *verletneedsupdate to true if necessary. This happens around every 100 iterations. If this occurs, a list, called verletlist, must be

      updated, which is done by the three kernels erase cells, buildVerlet1 and buildverlet2.  In the solution shown below, in every timestep

      memory is mapped from the GPU to host memory.

      Alternatively i tried to call the three kernels in the if branch on every iteration

      and surround the whole code within the kernels with a if condition, so that the kernels are foing nothin if *verletneddsupdate ist false.

      On my Radeon R9 280 this second way is a little bit faster then braching on the host

      (but only approx. 2 percent), however, on an intel HD4000 device (using the beignet platform on linux), the solution below is significantly faster as the other.

      (but aprrox. 20 times slower as runs on the dedicated Radeon Card using amds app).


      Now my questions. Is there a more efficent way for conditionally enqueue kernels, depending from the result of the former kernel as the both ways i used?
      If not, wich way is the better way in opencl 1.2.?  


      cl::Buffer bufferVerletNeedsUpdate = cl::Buffer(context,

              CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, verletNeedsUpdateSize,verletNeedsUpdate);


      for (int i = 0; i < maxTimesteps; ++i) {

                  queue.enqueueNDRangeKernel(verletStep1Kernel, cl::NullRange,

                          globalp, localp);


                  if (*verletNeedsUpdate) {


                      queue.enqueueNDRangeKernel(ereaseCellKernel, cl::NullRange,

                              globalc, localc);

                      queue.enqueueNDRangeKernel(buildVerlet1, cl::NullRange, globalp,


                      queue.enqueueNDRangeKernel(buildVerlet2, cl::NullRange, globalp,



                  queue.enqueueNDRangeKernel(verletStep2, cl::NullRange, globalp,


                  if (i % snapshot == 0) {

                      std::cout << "Verletlistupdates: " << verletupdates << std::endl;


                      if (i > 0) {

                          std::cout << "time " << timestep * i << "snapshot "

                                  << *verletNeedsUpdate << std::endl;

                          char filename[16]; // string which will contain the number

                          sprintf(filename, "./data/snap%04d", snapnumber++);

                          saveSnapShot(filename, positions, velocities, accelerations,


                      } // Write Data from llast Snapshot to HD

                        // then read the momentary data

                      queue.enqueueReadBuffer(bufferPositions, CL_TRUE, 0, datasize,


                      queue.enqueueReadBuffer(bufferVelocities, CL_TRUE, 0, datasize,


                      queue.enqueueReadBuffer(bufferAccelarations, CL_TRUE, 0,

                              datasize, accelerations);

                      queue.enqueueReadBuffer(bufferVerletNeedsUpdate, CL_TRUE, 0,

                              verletNeedsUpdateSize, verletNeedsUpdate);




        • Re: What is more efficent ....

          I think, device side enqueue would be perfect solution in this scenario if its an OpenCL 2.0 supported application.

          Each approach has its own overhead. 1st approach has an overhead of memory mapping, whereas, 2nd approach has some unnecessary kernel launch overhead. Cost of these two overheads depends on particular system setup. So, you may observe different performance on different systems. However, in my opinion, 1st one is more logical and cleaner. 2nd approach may give poor performance as number of dependent kernel launch grows.

          BTW, are the kernels inside the conditional block are independent? If so, you can use multiple queues to launch them. It may improve the overall performance.

          Another point is, you may try to rearrange or join the kernel codes such a way that it can avoid multiple kernel launching, if possible. However, in that case, you've to consider the impact due to long and conditional kernel code.



            • Re: What is more efficent ....

              I forgot, thanks for your suggestions!


              At the moment i do not have an opencl 2.0 enabled device. I realised meanwhile that, except for very small systems, the verletStep2 kernel is responsible for

              80% of the computation time, therefore i used now the first possibility due to the reasons you mentioned. By some optimisation of memory alignment

              i was able to improve the performance of my "toy simulation" further and now it is around 70 times faster then the non parallel CPU Version  (Radoeon R9 280 vs. Intel Core i5-4690) which is a remarkable speed-up in my opinion.