1 Reply Latest reply on Apr 20, 2016 8:33 AM by alexb

    Inexplicable performance drop in OpenCL kernels

    alexb

      I've written implementation of 2D Fast Fourier Transform and tested it in simple test application.

      One 2D FFT cycle consists of 3 kernel launches: 1D FFT kernel, transpose kernel, 1D FFT kernel.

      Analysis of the test application performed by the CodeXL profiler shows that 1D FFT takes 11 microseconds, transpose takes 6 microseconds.

      This is shown on the picture "test timeline" attached to the post.

      11 + 6 + 11 microseconds made me pretty happy.

      But when I've written the final application where these kernels are used, instead of 11 + 6 + 11, I've got 400 + 300 + 400 microseconds!

      This is shown on the picture "app timeline".

      40x degradation is completely unacceptable and puts in jeopardy the whole project.

      Additional info:

      -- both test and final app are ran on stand-alone PC with Windows (no issues with change of power supply), Radeon R9 280 GPU.

      -- text of all kernels is absolutely identical in the test project and target project.

      -- test project was written on C# Cloo binding of OpenCL, the CPU-side code looks like:

      clock.Start();

      for (int i = 0; i < nTimeSteps; i++)

      {

          gpu.Queue.CopyBuffer(buffer0, buffer1, null);

          gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

          gpu.Queue.Execute(transpose, null, new[] { n, (long) 256 }, new[] { 16, (long) 16 }, null);

          gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

          gpu.Queue.Finish();

      }

      clock.Stop();

      -- the target project is written in C++ and is more difficult to cite, as it is distributed among many .cpp files and is more complex.