Archives Discussions

alexb · ‎04-20-2016

I've written implementation of 2D Fast Fourier Transform and tested it in simple test application.

One 2D FFT cycle consists of 3 kernel launches: 1D FFT kernel, transpose kernel, 1D FFT kernel.

Analysis of the test application performed by the CodeXL profiler shows that 1D FFT takes 11 microseconds, transpose takes 6 microseconds.

This is shown on the picture "test timeline" attached to the post.

11 + 6 + 11 microseconds made me pretty happy.

But when I've written the final application where these kernels are used, instead of 11 + 6 + 11, I've got 400 + 300 + 400 microseconds!

This is shown on the picture "app timeline".

40x degradation is completely unacceptable and puts in jeopardy the whole project.

Additional info:

-- both test and final app are ran on stand-alone PC with Windows (no issues with change of power supply), Radeon R9 280 GPU.

-- text of all kernels is absolutely identical in the test project and target project.

-- test project was written on C# Cloo binding of OpenCL, the CPU-side code looks like:

clock.Start();

for (int i = 0; i < nTimeSteps; i++)

{

gpu.Queue.CopyBuffer(buffer0, buffer1, null);

gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

gpu.Queue.Execute(transpose, null, new[] { n, (long) 256 }, new[] { 16, (long) 16 }, null);

gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

gpu.Queue.Finish();

}

clock.Stop();

-- the target project is written in C++ and is more difficult to cite, as it is distributed among many .cpp files and is more complex.

alexb · ‎04-20-2016

I've found the reason: flag CL_MEM_ALLOC_HOST_PTR when creating buffers.

Sorry for bothering community.

Inexplicable performance drop in OpenCL kernels