OpenCL

george72 · ‎06-18-2018

We are using OpenCL on Windows as part of a proprietary game-engine where we use the CL-GL interop functionality to communicate between the simulation and the rendering engine. Our core loop currently executes the following steps:

Acquire GL objects
Run simulation using OpenCL
Release GL objects
Ensure OpenCL operations are finished
Render using OpenGL
Swap Buffers

Currently, our main bottleneck is step 4: "Ensure OpenCL operations are finished". In contrast to nVidia's drivers, AMD's do not support implicit OpenCL/OpenGL synchronization so we need to synchronize explicitly by having the CPU wait for the OpenCL kernels to finish before starting to submit our rendering commands. Needless to say, this becomes a severe performance bottleneck if the load on the GPU is increased, causing the wait to be an (unacceptable) 8 - 20 ms on a Radeon R9 290X.

The "official" advice is to use OpenGL's ARB_cl_event extension, but that extension is not supported on any driver we tested with. Is there some other (undocumented) way of achieving this synchronization in a faster way using AMD cards?

dipak · ‎06-20-2018

If the OpenGL context is bound to the current thread, I think, step 3 can be used for this implicit synchronization. Because clEnqueueReleaseGLObjects says:

If the cl_khr_gl_sharing extension is supported and if an OpenGL context is bound to the current thread, then any OpenGL commands which does:
affect or access the contents of a memory object listed in the mem_objects list, and
are issued on that context after the call to clEnqueueReleaseGLObjects
will not execute until after execution of any OpenCL commands preceding the clEnqueueReleaseGLObjects which affect or access any of those memory objects.

george72 · ‎06-29-2018

Unfortunately, the behavior I see on Windows 7 x64 with an RX 560 (CL driver version 2580.6) when disabling this CPU-side synchronization is first a corrupted display followed by a "display driver stopped responding" message which generally leads to an unstable system requiring a Windows reboot in order to get something working again.

The CL driver for the device reports CL_TRUE for the CL_DEVICE_PREFERRED_INTEROP_USER_SYNC property which tells me it does not implicitly synchronize and requires user synchronization.

dipak · ‎06-29-2018

Sorry, it seems I misread the standard. In the above case, application still needs to ensure the synchronization (either by the event object returned by clEnqueueReleaseGLObjects or using clFinish). Otherwise, it may cause an undefined behavior (and that's might be the reason for the driver crash).

george72 · ‎06-29-2018

Alright, but that leads us back to my original question: is there a faster way to synchronize on AMD than having the CPU wait on the event, because this method is too slow to be usable.

dipak · ‎07-02-2018

I'll check with the engg. team if they have any suggestion in this regard.

george72 · ‎07-03-2018

Thank you dipak,

I hope they can find a solution. If they want to see an example of our app, we have a free application available on Steam specifically to test-drive the OpenCL/OpenGL functionality: Military Operations: Benchmark on Steam

dipak · ‎08-23-2018

As we mostly discussed this topic via email, I just wanted to post few key notes/suggestions that might be helpful to other users as well.

The application needs to wait for OpenCL before the interop object usage in OpenGL.
The application can still submit the interop-independent OpenGL commands i.e. which do not require interop object. Below is a typical call sequence:

a. OCL commands

b. OGL commands

c. glFlush()

d. OCL wait for the interop release

e. OGL commands with the interop object and other

f. Present.

The application can also use two interop buffers to reduce the bottleneck and improve overall performance. For example:

a. Use two interop buffers – one for odd and one for even frames.

b. Use the interop object from the previous frame simulation in the current frame, in that case a wait should be pretty much a nop, because it’s done already. Also OGL and OCL will run completely asynchronously in this case.

Andrey2007AMD · ‎07-04-2018

I have same problem, my implementation of the Particle System and the Instance Culling more slowly than implementation on CPU side. May be this is the same problem with synchronization between OpenGL/Direct3D11 Vertex/Indirect buffers and OpenCL.

OpenCL

CL-GL Interop fastest way to synchronize?