cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

alexb
Journeyman III

Inexplicable performance drop in OpenCL kernels

I've written implementation of 2D Fast Fourier Transform and tested it in simple test application.

One 2D FFT cycle consists of 3 kernel launches: 1D FFT kernel, transpose kernel, 1D FFT kernel.

Analysis of the test application performed by the CodeXL profiler shows that 1D FFT takes 11 microseconds, transpose takes 6 microseconds.

This is shown on the picture "test timeline" attached to the post.

11 + 6 + 11 microseconds made me pretty happy.

But when I've written the final application where these kernels are used, instead of 11 + 6 + 11, I've got 400 + 300 + 400 microseconds!

This is shown on the picture "app timeline".

40x degradation is completely unacceptable and puts in jeopardy the whole project.

Additional info:

-- both test and final app are ran on stand-alone PC with Windows (no issues with change of power supply), Radeon R9 280 GPU.

-- text of all kernels is absolutely identical in the test project and target project.

-- test project was written on C# Cloo binding of OpenCL, the CPU-side code looks like:

clock.Start();

for (int i = 0; i < nTimeSteps; i++)

{

    gpu.Queue.CopyBuffer(buffer0, buffer1, null);

    gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

    gpu.Queue.Execute(transpose, null, new[] { n, (long) 256 }, new[] { 16, (long) 16 }, null);

    gpu.Queue.Execute(fft1D, null, new[] { workSize, (long) 256 }, new[] { workSize, (long) 1 }, null);

    gpu.Queue.Finish();

}

clock.Stop();

-- the target project is written in C++ and is more difficult to cite, as it is distributed among many .cpp files and is more complex.

0 Likes
1 Reply
alexb
Journeyman III

I've found the reason: flag CL_MEM_ALLOC_HOST_PTR when creating buffers.

Sorry for bothering community.

0 Likes