I've got a problem with kernels executing way longer than they should from time to time.
Basically I'm integrating a PDE using the pseudospectral method, meaning I got a loop and in each iteration I enqueue a bunch of custom kernels together with some forward/backward FFT transformations.
When I profile the application (application trace, using CodeXL 1.0.2409.0) I get for example the result shown in the attached picture. Every x'th kernel takes way longer to execute than the kernels before. Take for example the fft_fwd one. Afterwards come several more kernels and forward ffts including basically the same fft_fwd again (but 1 iteration further), which all take significantly less time to execute, until another kernel suddenly requires way more time than before (in the example: calc_nonLin_n). Afterwards several iterations are OK again.
Any ideas what might be causing this behavior? Any ideas how I could optimize the attached kernels in order to prevent it?
I run everything on a HD5850, latest beta driver. Platform version: AMD-APP (1084.2) (according to clinfo). AMD FFT library: 1.8.239. Windows 7 64-Bit, Visual Studio 2010 (C++).
My global work size is 256x64 (for the custom kernels), the local size is set to NULL. The FFT-library does real-to-complex and complex-to-real transformations of 256x64-matrices. The custom kernels only do some element-wise matrix multiplications. I attached both the kernel-sources and the host-source code which enqueues all kernels in 1 time step (the function is called in a loop).