Archives Discussions

kcin · ‎02-08-2012

My program uses OpenGL to draw on the screen. This task is low-intensive but latency-sensitive. Another CPU thread of my program makes some calculations which are highly-intensive but of low priority. I implemented this calculations to be run on GPU using OpenCL. Since then I observe too big visual latency of the OpenGL drawing.

Is it possible to manage the GPU loading or to schedule tasks on GPU? I use AMD Radeon HD 5850. As far as I know, there is 'Device Fission' extension for OpenCL, which is helpful in such cases, but it is supported only for CPUs, isn't it?

Another thing I'm concerned with is that AMD System Monitor shows GPU occupancy at 55% level. So, in my opinion, there should be enough resources to run both parts, OpenGL and OpenCL, without stalling.

I made a small experiment with my OpenCL kernel function. A global work size was restricted with a small enough number to guarantee that some computing units will be idle when an OpenCL kernel is running. To compensate that, each work-item do more job. So GPU now is less occupied but kernel works longer. As a result, OpenGL becomes slower. Hence, it looks like OpenGL drawing and OpenCL calculations can't be done concurrently. Where can I find the information which cards supports such functionality?

Meteorhead · ‎02-09-2012

Sorry if I wasn't clear enough.

1: By two types of kernels I mean any two kernels that don't share exactly the same kernel/shader code.

2: Inside the commandqueue, no. Commandqueue is a completely host side mechanism and there groups of enqueueNDRange will only take over each other, if CL_OUT_OF_ORDER_QUEUE is enabled on the commandqueue (currently only supported by Intel SDK). The commandqueue and the thread scheduler of the device are two different queues, so to say. You have full control over the prior, and practically none over the latter. If you look at the event timings associated with a kernel, SUBMIT - ENQUEUE is the time the kernel spent waiting in the commandqueue, START - SUBMIT is the time it spent on the pci bus AND inside the device thread scheduler, and naturally END - START is the time it spent executing.

About technical papers... I really don't know. All these infos I gathered from different news portals and technical reviews. The only things I'm not 100% sure about, is that HD7xxx can only handle two sets of kernels at once, since the thread scheduler was redesigned and called ACE (Asynchronous Compute Engine) if I'm not mistaken about the abbreviation. It was said that it will be able to feed idle Compute Units a lot better than earlier generations, but somewhere else they said something like that there are two ACEs, and they operate on two different parts of the GPU (16CU for one, and 16CU for the other). If they are truly asynchronous and can handle multiple types of kernels (group by code), then it should be completely transparent (or irrelevant from programming view) that there are two ACEs, and most likely it's only a neccessity due to the fact that one ACE cannot handle more than 16CUs efficiently. (Which is absolutely no problem if there can be multiple engines that share the same thread queue on the device.

View solution in original post

Meteorhead · ‎02-08-2012

HD5xxx series does not support different kernels to execute concurrently. HD7xxx series is the first that is able to do this, and even they can only do with 2 types of kernels. (Someone correct me if I'm wrong) The thread scheduler of Fermi cards are more advanced, they can handle a lot more tyes of kernels and fill in gaps of idle Compute Units with kernels.

FYI, HDxxxx cards all feature thread schedulers that can queue multiple types of kernels, but only one type can execute at any given time. (HD7xxx has two types of kernels) It is good to know, that if you dispatch kernels in the command queue, and concurrently flush OpenGL commands, both the compute and display kernels are dispatched onto the device, but there is no prioritizing of kernels (not even on HD7xxx), and compute kernels are always executed before display kernels.

kcin · ‎02-08-2012

Thank you for the useful answer!

I missed two points from this:

1. When you say that 2 types of kernels can be executed concurrently what do you mean. OpenGL and OpenCL kernels? Two different OpenCL kernels? Something else?

2. You said "there is no prioritizing of kernels (not even on HD7xxx), and compute kernels are always executed before display kernels". How should it be understood? Compute kernels are moved up in the command queue if there are any OpenGL commands before?

Is there any technical paper where I can find information on my issue?

Meteorhead · ‎02-09-2012

Sorry if I wasn't clear enough.

1: By two types of kernels I mean any two kernels that don't share exactly the same kernel/shader code.

2: Inside the commandqueue, no. Commandqueue is a completely host side mechanism and there groups of enqueueNDRange will only take over each other, if CL_OUT_OF_ORDER_QUEUE is enabled on the commandqueue (currently only supported by Intel SDK). The commandqueue and the thread scheduler of the device are two different queues, so to say. You have full control over the prior, and practically none over the latter. If you look at the event timings associated with a kernel, SUBMIT - ENQUEUE is the time the kernel spent waiting in the commandqueue, START - SUBMIT is the time it spent on the pci bus AND inside the device thread scheduler, and naturally END - START is the time it spent executing.

About technical papers... I really don't know. All these infos I gathered from different news portals and technical reviews. The only things I'm not 100% sure about, is that HD7xxx can only handle two sets of kernels at once, since the thread scheduler was redesigned and called ACE (Asynchronous Compute Engine) if I'm not mistaken about the abbreviation. It was said that it will be able to feed idle Compute Units a lot better than earlier generations, but somewhere else they said something like that there are two ACEs, and they operate on two different parts of the GPU (16CU for one, and 16CU for the other). If they are truly asynchronous and can handle multiple types of kernels (group by code), then it should be completely transparent (or irrelevant from programming view) that there are two ACEs, and most likely it's only a neccessity due to the fact that one ACE cannot handle more than 16CUs efficiently. (Which is absolutely no problem if there can be multiple engines that share the same thread queue on the device.

Archives Discussions

OpenCL and OpenGl simultaneous execution