8 Replies Latest reply on Jan 21, 2016 9:41 AM by cgrant78@netzero.com

    OpenCL clSetKernelArg performance issue




      I'm working on a real time ray tracer with OpenCL.

      I have a structure that describes the camera, with position, orientation, field of view.
      Since the camera moves around I send it every frame to the GPU.


      And I'm doing it like this:


      Computedcamera    cm;

      // some code here

      clSetKernelArg(_kernel, 0, sizeof(Computedcamera), &cm);


      and my kernel looks like this:


      kernel void raytracer(Computedcamera const *camera, /*others arguments) { /* */ }


      This method gives best performance, 48FPS on my test scene, but it doesn't works on devices from other brand, like Intel or Nvidia.


      If I change my kernel declaration to this (remove the *, to pass the argument by value):
      kernel void raytracer(Computedcamera const camera, /*others arguments) { /* */ }

      It now works on my CPU (Intel), but the performance on my GPU drops to 27FPS.


      So I tried to pass this argument by buffer:


      _camera_mem = clCreateBuffer(_context, CL_MEM_READ_ONLY | CL_MEM_HOST_WRITE_ONLY, sizeof(Computedcamera), 0, &error);

      each frame:

      clEnqueueWriteBuffer(_queue, _camera_mem, CL_TRUE, 0, sizeof(Computedcamera), (void *)&cm, 0, 0, 0);

      clSetKernelArg(_kernel, 0, sizeof(cl_mem), &_camera_mem);

      This last method works on all devices (AMD GPU, Intel CPU) but the performance on my GPU is around 43FPS (5 FPS less than the first method).


      I don't understand why the first method is faster than the others and why it works?!


      My config:

      Win10 64bits, AMD APP SDK 3.0, i7 920, R9 Nano.

        • Re: OpenCL clSetKernelArg performance issue

          Welcome and thanks for posting.

          I have white-listed you, so you should be able to directly post in the relevant forum. As this post is relevant to OpenCL, I am moving it to the OpenCL forum.


          Happy posting.



          • Re: OpenCL clSetKernelArg performance issue

            I woul try pass it as buffer but use __constant memory space.

            • Re: OpenCL clSetKernelArg performance issue

              Can you run a CodeXL application timeline trace session and test whether the kernel execution time got prolonged or is it something else?

                • Re: OpenCL clSetKernelArg performance issue



                  I have run CodeXL application timeline trace session and yes the kernel execution time got prolonged.
                  I have made some changes since my previous post, like using OpenGL interop for drawing the texture computed by the kernel, and making the clEnqueueWriteBuffer non blocking.


                  Now, I'm talking only about the broken clSetKernelArg method and the clCreateBuffer + clEnqueueWriteBuffer method.
                  The data about the camera used in the kernel are 2 lines always executed and only once and are not related to another argument passed to the kernel.
                  When the camera doesn't look at the scene I got 500fps with the first method and 800fps with the second one

                  When the camera looks at the scene it's 160fps with the first one and 130fps with the second one.
                  The data transferred to the device are constant (and the time taken by the transfer), the only thing that change is the execution time of the kernel.

                    • Re: OpenCL clSetKernelArg performance issue

                      Some historical perspective.

                      Setting shader kernel arguments has always been a very expensive operation by all APIs and OpenCL is the same.

                      Setting kernel parameters is really supposed to be a single-shot operation immediately after creation and then forgot forever.


                      Therefore it is no surprise using a buffer and avoiding resetting kernel args to be faster as in first case.  It is much more surprising this is slower when more work is done, very disappointing to see the behavior is inconsistent.


                      This is especially the case for GCN, if you look at the details it has no real hardware constant buffer support. Constants are emulated at driver level, the driver must figure out a layout and push the data in a buffer synthesized for you.


                      Try the following: at beginning of your kernel, use a block copy operation (or do the copy yourself) to pull data from global/constant to LDS and then read from there.

                  • Re: OpenCL clSetKernelArg performance issue

                    Could you get actual clock timings instead of using FPS as a performance metric? Just saying my algorithm went from 1000 FPS to 100 FPS gives no indication to anyone trying to ascertain why there is a performance decrease. This was already suggested before, but FPS is not a valid performance metric especially for a developer.