2 Replies Latest reply on Aug 11, 2017 7:52 AM by asylum29

    Reasons of glDrawElements(BaseVertex) having huge driver overhead?


      I'm currently working on the refactoration of a complex and likely CPU/driver bound renderer.

      In a nutshell:

      - all rendering is done on a background thread (consumer) which is fed through a blocking queue of "rendering tasks"

      - the main thread (producer) does a lot of small batch draw calls like "moveto(a, b) - lineto(c, d) - lineto(e, f) - ... - flush()"; inserting a task into the queue when necessary

      - I designed the abstract interface so that it resembles modern graphics APIs (Metal/Vulkan), so I have buffers, textures, renderpasses and graphics pipelines

      - the corresponding GL objects are cached, whenever possible (VAOs, programs, framebuffers), so that GL object creations are minimized

      - GL state changes are not managed (with a few exceptions like framebuffer, VAO and program changes)

      - buffer data uploads are optimized with the GL_MAP_UNSYNCHRONIZED_BIT; should the buffer become full, I do a glFenceSync+glWaitSync (doesn't happen too often tho)


      Despite my efforts, so far the rendering is incredibly slow. VTune shows that the majority of the CPU time is spent in glDrawElementBaseVertex:



      Same thing on Intel cards. It pretty much looks like a driver limit case, however the funny thing is, that the old DX9 implementation is like 200x faster (hell, even GDI is faster).


      So my question is: what might cause a drawing command to have such overhead? Or in other words: what state changes should I look out for to avoid this?


      UPDATE: also tried buffer orphaning and STREAM_DRAW flag...no effect...

        • Re: Reasons of glDrawElements(BaseVertex) having huge driver overhead?

          The problem is somewhat solved... It looks like the GPU was overflooded with draw commands.

          However after solving this, another problem arised with glMapBufferRange. Obviously I provide the GL_MAP_UNSYNCHRONIZED_BIT, and when the buffer gets full I map with GL_MAP_INVALIDATE_BUFFER_BIT (a.k.a. orphan it).


          For a few seconds the performance is ok, but then suddenly drops to 4 fps, the reason being glMapBufferRange (like it's waiting for something). Perhaps the multiple buffer/fencesync approach would be preferable... : /

          • Re: Reasons of glDrawElements(BaseVertex) having huge driver overhead?

            I mark this answer as correct, but I didn't really "solve" the problem...


            The GPU being flooded was "my fault" (or more precisely, the difference between DX and GL implementation logics).

            The other problem which I mentioned can be derived to the driver ignoring certain GL flags (like GL_MAP_UNSYNCHRONIZED_BIT). Orphaning only made things worse.


            So I did the following: I mapped the buffer with the unsyncronized flag and when it got full I simply used a glFenceSync + glWaitSync. Using this approach the performance is still slower than DX9, but doesn't drop at least...


            Another thought (and I'm going to open another discussion): unifrom buffer objects are INCREDIBLY SLOW (and I really mean it).

            What is my reasoning? As I started to refactor the 3D part of the application I noticed that using uniform buffers with _every_ possible optimization still yields performance like ten times slower than the original GL 2.1 glUniformXXX approach. This is ridiculous. No matter the round-robin approach, no matter the fences, the driver _still_blocks_in_that_<flower>_glMapBufferRange_ ...