I'm trying to read back some buffer data (~56kb) that has been generated GPU side. To avoid stalls I'm cycling through 4 buffers so that I'm reading data that was generated 4 frames ago.
I've tried MapBufferRange with GL_MAP_READ_BIT and in combination with GL_MAP_UNSYNCHRONIZED_BIT. I've also tried GetBufferData. All combinations generate the same amount of stall as when I tested calling glFlush before the call, so I'm suspecting the driver does a flush internally no matter what. (the stall is about 15-20ms)
Is the driver broken or what am I doing wrong here?
if the pbo is actually allocated in video memory, then there will be a copy on map/getbufferdata.
when this happens, you will then have a stall because you need for that copy to happen. in order to have no stalls, then you need a pbo which is allocated on the host instead of local framebuffer.
can you contact our devrel team for more information ? you can put my name in CC.
Thanks for the reply.
I've tried creating the buffer with different usage hints (DYNAMIC_READ, DYNAMIC_DRAW, STATIC_READ, STREAM_READ, etc..) but they all behave the same. The buffer is only used as a ReadPixels destination from a FBO and then in the GetBufferData/MapBufferRange operation 4 frames later. I'm not sure what else I can do to make the buffer reside in host memory. Any suggestions?
Out of desperation I also tried copying to different buffer first which is never involved in ReadPixels from FBO and then GetBufferData from that buffer. Same result unfortunately. (ie, FBO->Buffer1->Buffer2, wait 4 frames, GetBufferData(Buffer2) )
have you tried to use the bit GL_MAP_UNSYNCHRONIZED_BIT without ever calling unmap, then you should have no synchronization at all ?
since the driver is not buffering more than 4 frames, then that would be the fastest implementation anyway.
PS: using the unsynchronized bit will force the allocation to be in system remote memory (that is the only memory type with reasonable performance for both GPU/CPU access which does not involve a copy)
Keeping the buffers perma-mapped works. Somehow I thought buffers couldn't be used in GL operations while they are mapped so this never crossed my mind.
Do you think it's possible that the driver could pick the memory location from the buffer usage hints in the future? (or figure it out dynamically when things stall on video->system copies) The perma-mapping solution require that developers do their own syncing (at least if they want performance) and it seems a bit unnecessary when the API is already designed to take care of it.
the driver is not required to actually throw any error, because it would involve way too much CPU validation.
in practice, many cases will work, as long as you understand how you need to synchronize. in the present case, you can use a sync_object to make sure that the frame N-4 is done; it will be easy and efficient.
about the usage hints, we have been forced to ignore them because many applications got it wrong. our implementation is now dynamically moving the object based on usage (gpu / cpu access)