I have recently implemented a system which allows our game engine to use persistent and coherently mapped buffers using OpenGL. While this gives a massive speed boost when it comes to mapping immediate geometry such as debug shapes, UI and particles, it has some drawbacks. It's apparent that when using this API, one has the ability to buffer several frames in the future seeing as you are allowed to write to a CPU local pointer which is the asynchronously passed to the GPU whenever there is a draw.
In some cases, one might want to have a uniform buffer which is updated synchronously, such as per frame variables like the view and projection matrices. This brings me to the problem at hand. I can either chose to map for example view and projection matrices using coherent-persistent mapping, which gives a lot of synchronization problems since the camera is updated per frame, but per draw variables are updated just as they are rendered. I chose to implement a way to allow us to directly update the buffer in place using glBufferSubData.
This is all fine in principle, I use a buffer object which is 1024 times (for testing purposes) as big as it needs to be, and I only perform BufferSubData on one fragment per frame in a ring buffer fashion, so as to not stall the GPU. I then use glBindBufferRange to bind only the newly updated segment so that it may be used by the GPU. However this is the output from PerfStudio 3.0:
As you can see, BufferSubData is super slow (5.34 milliseconds) and is always on the beginning of a new frame. I thought this might be because BufferSubData waits for all prior draw calls to finish, which implicitly causes synchronization each frame. But seeing as I don't perform BufferSubData on a segment of memory which hopefully isn't currently being used, this result seems unnecessarily slow. I also tried using the same method for per frame variables as I did with per object variables, i.e calling ClientWaitSync and FenceSync to wait for a segment of memory to be used before writing to it again. The only difference was that instead of BufferSubData taking 5.34 milliseconds, ClientWaitSync takes its place.