I'm currently working on the refactoration of a complex and likely CPU/driver bound renderer.
In a nutshell:
- all rendering is done on a background thread (consumer) which is fed through a blocking queue of "rendering tasks"
- the main thread (producer) does a lot of small batch draw calls like "moveto(a, b) - lineto(c, d) - lineto(e, f) - ... - flush()"; inserting a task into the queue when necessary
- I designed the abstract interface so that it resembles modern graphics APIs (Metal/Vulkan), so I have buffers, textures, renderpasses and graphics pipelines
- the corresponding GL objects are cached, whenever possible (VAOs, programs, framebuffers), so that GL object creations are minimized
- GL state changes are not managed (with a few exceptions like framebuffer, VAO and program changes)
- buffer data uploads are optimized with the GL_MAP_UNSYNCHRONIZED_BIT; should the buffer become full, I do a glFenceSync+glWaitSync (doesn't happen too often tho)
Despite my efforts, so far the rendering is incredibly slow. VTune shows that the majority of the CPU time is spent in glDrawElementBaseVertex:
Same thing on Intel cards. It pretty much looks like a driver limit case, however the funny thing is, that the old DX9 implementation is like 200x faster (hell, even GDI is faster).
So my question is: what might cause a drawing command to have such overhead? Or in other words: what state changes should I look out for to avoid this?
UPDATE: also tried buffer orphaning and STREAM_DRAW flag...no effect...