I'm a researcher developing Particle-in-Cell simulations in plasma physics using OpenCL with AMD's GPUs. Particle-in-Cell is an iterative method (iterating through time), which means we've got a "for" loop in which all kernels are enqueued for every time step.
A part of the algorithm is solving the Poisson equation, but in this case this is too simple of an operation to solve on the GPU, therefore should be done on the CPU. Thus, every time step we need to copy a small amount of data (400 floats, 1,6 kB of data in total) from the GPU memory to CPU memory and back using clEnqueueReadBuffer() and clEnqueueWriteBuffer() functions. However, despite the small amount of data, we're experiencing a massive overhead (over 90% of the program runtime) while performing the copy, rendering the whole program unusable. Mapping the buffers performs somewhat better, but it's still really slow.
I'm developing on Windows, and having discussed this with my colleagues who work with Linux, it appears the overhead doesn't exist on Linux. Switching to Linux is not desired, however, because some development tools we're using are unavailable for Linux.
I'm using the AMD Radeon Pro WX 9100 GPU, running the newest enterprise drivers. Any idea what could be causing this massive overhead? Could it be a driver-related issue?
Thank you for the help!