I've written implementation of 2D Fast Fourier Transform and tested it in simple test application.
One 2D FFT cycle consists of 3 kernel launches: 1D FFT kernel, transpose kernel, 1D FFT kernel.
Analysis of the test application performed by the CodeXL profiler shows that 1D FFT takes 11 microseconds, transpose takes 6 microseconds.
This is shown on the picture "test timeline" attached to the post.
11 + 6 + 11 microseconds made me pretty happy.
But when I've written the final application where these kernels are used, instead of 11 + 6 + 11, I've got 400 + 300 + 400 microseconds!
This is shown on the picture "app timeline".
40x degradation is completely unacceptable and puts in jeopardy the whole project.
Additional info:
-- both test and final app are ran on stand-alone PC with Windows (no issues with change of power supply), Radeon R9 280 GPU.
-- text of all kernels is absolutely identical in the test project and target project.
-- test project was written on C# Cloo binding of OpenCL, the CPU-side code looks like:
clock.Start();
for (int i = 0; i < nTimeSteps; i++)
{
gpu.Queue.CopyBuffer(buffer0, buffer1, null);
gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);
gpu.Queue.Execute(transpose, null, new[] { n, (long) 256 }, new[] { 16, (long) 16 }, null);
gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);
gpu.Queue.Finish();
}
clock.Stop();
-- the target project is written in C++ and is more difficult to cite, as it is distributed among many .cpp files and is more complex.