- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CL-GL Interop fastest way to synchronize?
We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the following steps:
- Acquire GL objects
- Run simulation using OpenCL
- Release GL objects
- Ensure OpenCL operations are finished
- Render using OpenGL
- Swap Buffers
Currently, our main bottleneck is step 4: "Ensure OpenCL operations are finished". In contrast to nVidia's drivers, AMD's do not support implicit OpenCL/OpenGL synchronization so we need to synchronize explicitly by having the CPU wait for the OpenCL kernels to finish before starting to submit our rendering commands. Needless to say, this becomes a severe performance bottleneck if the load on the GPU is increased, causing the wait to be an (unacceptable) 8 - 20 ms on a Radeon R9 290X.
The "official" advice is to use OpenGL's ARB_cl_event extension, but that extension is not supported on any driver we tested with. Is there some other (undocumented) way of achieving this synchronization in a faster way using AMD cards?
- Labels:
-
OCL Performance and Benchmark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the OpenGL context is bound to the current thread, I think, step 3 can be used for this implicit synchronization. Because clEnqueueReleaseGLObjects says:
If the cl_khr_gl_sharing extension is supported and if an OpenGL context is bound to the current thread, then any OpenGL commands which does:
- affect or access the contents of a memory object listed in the
mem_objects
list, and- are issued on that context after the call to
clEnqueueReleaseGLObjects
will not execute until after execution of any OpenCL commands preceding the
clEnqueueReleaseGLObjects
which affect or access any of those memory objects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, the behavior I see on Windows 7 x64 with an RX 560 (CL driver version 2580.6) when disabling this CPU-side synchronization is first a corrupted display followed by a "display driver stopped responding" message which generally leads to an unstable system requiring a Windows reboot in order to get something working again.
The CL driver for the device reports CL_TRUE for the CL_DEVICE_PREFERRED_INTEROP_USER_SYNC property which tells me it does not implicitly synchronize and requires user synchronization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, it seems I misread the standard. In the above case, application still needs to ensure the synchronization (either by the event object returned by clEnqueueReleaseGLObjects or using clFinish). Otherwise, it may cause an undefined behavior (and that's might be the reason for the driver crash).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright, but that leads us back to my original question: is there a faster way to synchronize on AMD than having the CPU wait on the event, because this method is too slow to be usable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll check with the engg. team if they have any suggestion in this regard.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you dipak,
I hope they can find a solution. If they want to see an example of our app, we have a free application available on Steam specifically to test-drive the OpenCL/OpenGL functionality: Military Operations: Benchmark on Steam
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As we mostly discussed this topic via email, I just wanted to post few key notes/suggestions that might be helpful to other users as well.
- The application needs to wait for OpenCL before the interop object usage in OpenGL.
- The application can still submit the interop-independent OpenGL commands i.e. which do not require interop object. Below is a typical call sequence:
a. OCL commands
b. OGL commands
c. glFlush()
d. OCL wait for the interop release
e. OGL commands with the interop object and other
f. Present.
- The application can also use two interop buffers to reduce the bottleneck and improve overall performance. For example:
a. Use two interop buffers – one for odd and one for even frames.
b. Use the interop object from the previous frame simulation in the current frame, in that case a wait should be pretty much a nop, because it’s done already. Also OGL and OCL will run completely asynchronously in this case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have same problem, my implementation of the Particle System and the Instance Culling more slowly than implementation on CPU side. May be this is the same problem with synchronization between OpenGL/Direct3D11 Vertex/Indirect buffers and OpenCL.
