We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the following steps:
Currently, our main bottleneck is step 4: "Ensure OpenCL operations are finished". In contrast to nVidia's drivers, AMD's do not support implicit OpenCL/OpenGL synchronization so we need to synchronize explicitly by having the CPU wait for the OpenCL kernels to finish before starting to submit our rendering commands. Needless to say, this becomes a severe performance bottleneck if the load on the GPU is increased, causing the wait to be an (unacceptable) 8 - 20 ms on a Radeon R9 290X.
The "official" advice is to use OpenGL's ARB_cl_event extension, but that extension is not supported on any driver we tested with. Is there some other (undocumented) way of achieving this synchronization in a faster way using AMD cards?
If the OpenGL context is bound to the current thread, I think, step 3 can be used for this implicit synchronization. Because clEnqueueReleaseGLObjects says:
If the cl_khr_gl_sharing extension is supported and if an OpenGL context is bound to the current thread, then any OpenGL commands which does:
- affect or access the contents of a memory object listed in the
- are issued on that context after the call to
will not execute until after execution of any OpenCL commands preceding the
clEnqueueReleaseGLObjectswhich affect or access any of those memory objects.
Unfortunately, the behavior I see on Windows 7 x64 with an RX 560 (CL driver version 2580.6) when disabling this CPU-side synchronization is first a corrupted display followed by a "display driver stopped responding" message which generally leads to an unstable system requiring a Windows reboot in order to get something working again.
The CL driver for the device reports CL_TRUE for the CL_DEVICE_PREFERRED_INTEROP_USER_SYNC property which tells me it does not implicitly synchronize and requires user synchronization.
Sorry, it seems I misread the standard. In the above case, application still needs to ensure the synchronization (either by the event object returned by clEnqueueReleaseGLObjects or using clFinish). Otherwise, it may cause an undefined behavior (and that's might be the reason for the driver crash).
Alright, but that leads us back to my original question: is there a faster way to synchronize on AMD than having the CPU wait on the event, because this method is too slow to be usable.
Thank you dipak,
I hope they can find a solution. If they want to see an example of our app, we have a free application available on Steam specifically to test-drive the OpenCL/OpenGL functionality: Military Operations: Benchmark on Steam
I have same problem, my implementation of the Particle System and the Instance Culling more slowly than implementation on CPU side. May be this is the same problem with synchronization between OpenGL/Direct3D11 Vertex/Indirect buffers and OpenCL.
As we mostly discussed this topic via email, I just wanted to post few key notes/suggestions that might be helpful to other users as well.
a. OCL commands
b. OGL commands
d. OCL wait for the interop release
e. OGL commands with the interop object and other
a. Use two interop buffers – one for odd and one for even frames.
b. Use the interop object from the previous frame simulation in the current frame, in that case a wait should be pretty much a nop, because it’s done already. Also OGL and OCL will run completely asynchronously in this case.