With yesterday's announcement by AMD about Mantle and the performance gains in this low level API, I began wondering about OpenCL.
If you are not familiar with Mantle, Anandtech has a pretty good summary (AnandTech Portal | Understanding AMD’s Mantle: A Low-Level Graphics API For GCN). What I got out of this was because of the significant overhead of writing to a generic device in Direct X or OpenGL, the performance inherently suffers. Coding directly to the new AMD Hawaii GPU (5 TFLOPS, btw) with the Mantle API developers can achieve 9x performance in draw requests. My question is if anyone has an idea what kind of overhead OpenCL introduces and what, if anything, we can do to get around it. If I could get 9x or even 3x performance improvements by coding to a specific device (E.g. a high end Firepro) I would be more than happy to do that for my most performance intensive subroutines.
That 9x draw call gain means that the card can run more different shaders, and switch between more textures in every frame it draws. As the computing power is getting higher, it became difficult to feed the gpu with enough work using current directx systems. It needs frequent interaction between cpu and gpu. Whit this new low level api this interaction will be 9x faster, which leads to higher gpu utilization.
On opencl it also introduces a significant overhead when you try to enqueue kernels very frequently: lets say if you have 3 ms long kernel, then the actual gpu work can take only 1 ms time, the rest is host-gpu interaction. But on opencl we can organize long kernels around 1 sec execution times, those can utilize the gpu effectively.
mantle solves primary issues around:
- CPU overhead for any single API call
- move memory management outside of critical path
- allow scaling of CPU setup performance across N threads (with no hidden lock)
- some GPU perf improvements by grouping of HW states
compute is inherently much lower overhead, and actually with HSA, the objective is to remove most of the OS/interaction overhead to descrease dispatch time.
I would recommend that you contact the FirePro ISV team to get better information.
Well, this going to be a coll optimization.
(CPU scaling: I know that feel with the overuse of LOCK prefix. It kicks the whole CPU unconscious, haha. Just realised it with Delphi's - otherwise really fast and efficient - memory manager. That one can't handle much memory managing while using multiple cores.)
That's interesting... I haven't seen anyone link Mantle to the Firepro driver yet. Are they related?
Back to the OpenCL overhead, realhet is correct that there is still significant overhead, at least in the catalyst/fglrx implementation of it. Overhead, which forces the opencl programmer to either organize kernels to have sufficiently long execution times relative to the enqueue overhead time or find a way to submit opencl kernels concurrently. Even for simple kernels, catalyst's user-side can initiate a hundred or more ioctl() calls to the fglrx driver on the first enqueue call which can eat up anywhere from a half-millisecond to a couple milliseconds before actual GPU execution (repeated calls to the same enqueue are then still lengthy -- on the order of a tenth of a millisecond). Thus, the programmer needs to plan for kernel execution times on the order of a ms to a full second. If the enqueue'ing overhead was instead an order of magnitude less (say 5 microseconds, the typical unix interprocess communication latency), it would make the programming easier.
I have catalyst 13.12, HD7870 and wrote a fluid solver that can do maximum 1500-2000 kernel executions per second for a single command queue on a 512x512 grid. This is with only a single clFinish() at the end of queue. When there is no clFinish(), it can do about 3000 executions per second but this time it is asynched and opengl bugs some artifacts. But its still computing much faster than my 4-module fx8150 for both raytracing and fluid solving. I think it would be very cool if it could run 100k kernels per second. This could highly increase performance of iterative raytracer versus a semi-recursive version.