AnsweredAssumed Answered

Inexplicable performance drop in OpenCL kernels

Question asked by alexb on Apr 20, 2016
Latest reply on Apr 20, 2016 by alexb

I've written implementation of 2D Fast Fourier Transform and tested it in simple test application.

One 2D FFT cycle consists of 3 kernel launches: 1D FFT kernel, transpose kernel, 1D FFT kernel.

Analysis of the test application performed by the CodeXL profiler shows that 1D FFT takes 11 microseconds, transpose takes 6 microseconds.

This is shown on the picture "test timeline" attached to the post.

11 + 6 + 11 microseconds made me pretty happy.

But when I've written the final application where these kernels are used, instead of 11 + 6 + 11, I've got 400 + 300 + 400 microseconds!

This is shown on the picture "app timeline".

40x degradation is completely unacceptable and puts in jeopardy the whole project.

Additional info:

-- both test and final app are ran on stand-alone PC with Windows (no issues with change of power supply), Radeon R9 280 GPU.

-- text of all kernels is absolutely identical in the test project and target project.

-- test project was written on C# Cloo binding of OpenCL, the CPU-side code looks like:

clock.Start();

for (int i = 0; i < nTimeSteps; i++)

{

    gpu.Queue.CopyBuffer(buffer0, buffer1, null);

    gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

    gpu.Queue.Execute(transpose, null, new[] { n, (long) 256 }, new[] { 16, (long) 16 }, null);

    gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

    gpu.Queue.Finish();

}

clock.Stop();

-- the target project is written in C++ and is more difficult to cite, as it is distributed among many .cpp files and is more complex.

Attachments

Outcomes