We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the following steps:
- Acquire GL objects
- Run simulation using OpenCL
- Release GL objects
- Ensure OpenCL operations are finished
- Render using OpenGL
- Swap Buffers
Currently, our main bottleneck is step 4: "Ensure OpenCL operations are finished". In contrast to nVidia's drivers, AMD's do not support implicit OpenCL/OpenGL synchronization so we need to synchronize explicitly by having the CPU wait for the OpenCL kernels to finish before starting to submit our rendering commands. Needless to say, this becomes a severe performance bottleneck if the load on the GPU is increased, causing the wait to be an (unacceptable) 8 - 20 ms on a Radeon R9 290X.
The "official" advice is to use OpenGL's ARB_cl_event extension, but that extension is not supported on any driver we tested with. Is there some other (undocumented) way of achieving this synchronization in a faster way using AMD cards?
If the OpenGL context is bound to the current thread, I think, step 3 can be used for this implicit synchronization. Because clEnqueueReleaseGLObjects says: