3 Replies Latest reply on Mar 2, 2015 10:26 AM by jtrudeau

    Transform feedback objects not working as intended?




      Currently I've been working on increasing the performance of character rendering by first performing the skinning in a vertex shader using transform feedbacks, and then render the skinned version of the mesh each time it's required (deferred, shadows, picking). So the old GL3.3 APIs would use an ordinary OpenGL buffer, bound with GL_TRANSFORM_FEEDBACK_BUFFER, do glBeginTransformFeedback->Draw->glEndTransformFeedback then query the amount of vertices being written and then bind the same buffer as GL_ARRAY_BUFFER and then render using said amount of vertices. But with the API provided in GL4.0, we can instead use transform feedback objects which should handle how many vertices is written to it from a transform feedback draw, which is nice because we don't have to stall the driver; as was the intended purpose, if I understood it correctly.


      Here's the problem, only the first updated mesh is rendered when using the following code (which is done per each mesh I want to update):


      glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, fb->GetOGL4TransformFeedback());
      glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ib->GetOGL4IndexBuffer());
      glDrawElementsBaseVertex(primType, this->primitiveGroup.GetNumIndices(), indexType, NULL, this->primitiveGroup.GetBaseVertex());
      glBindTransformFeedback(GL_TRANSFORM_FEEDBACK, 0);


      The transform feedback is unique for every model I intend to render, and it has its own buffer linked to it as well. When rendering using:


      glDrawTransformFeedback(primType, fb->GetOGL4TransformFeedback());


      For all intended models, only the first gets rendered, which I assume is because the transform feedback is not complete when the draw command comes. I make this assumption because any call to glFlush, or any query object read (even using GL_QUERY_RESULT_NO_WAIT) results in all models rendering, but also obviously results in a major performance impact because every time I update a feedback buffer (which is once per each animated character) it results in a synchronization point. I've tried this on an nVidia-card too (790), and it doesn't produce the same result, instead all objects are rendered as they should be, however the nVidia driver seems to implicitly enforce a synchronization which, much to my disappointment, results in the same sluggish performance as using glFlush. Is this to be assumed? It's actually faster right now to perform the skinning each time I need to do the actual rendering. I have no other reason but to assume I'm doing something wrong here, but what is it?


      Thanks in advance!

        • Re: Transform feedback objects not working as intended?

          Sorry for the delay. I try to focus the engineers here on problems/issues/questions with our tools and SDKs. So general "programming advice" I try to leave to the community. But, this has been sitting for a while, so I checked with someone internally, asked them to take a look. Here's what I got back.


          My recommendation to this user is to not do this at all. The overhead of writing all the data out to memory in transform feedback, then reading it all back in again on the subsequent draw, plus the possibility of a pipeline stall as a result will dwarf the cost of just running the vertex shader twice unless the shader is extremely complex.

          Hope this helps.

          1 of 1 people found this helpful
            • Re: Re: Transform feedback objects not working as intended?

              Okay, thanks for the feedback!


              It doesn't quite explain why transform feedback objects has to be manually synced with a glFlush in order to work. But I will refrain from using that API if it doesn't come highly recommended.


              I've been looking into a more efficient way to handle this entire process, and I thought that maybe the rendering was slow due to the fact that for each draw of a character (which would be one for each shadow casting light, 4 times for CSM, then picking, color, etc.) the engine was downloading the whole joint matrix to the GPU. What I find is that the actual draw call doesn't take almost any time at all. So I implemented a way to update shader storage buffers using persistently mapped buffers, and also made them triple buffered so that I wouldn't have to stall too, and then fed the joint matrices to the GPU that way. Here comes the interesting part. PerfStudio says that my frame time is somewhere around 10 ms for 215 characters, which is fair since one frame consists of a great deal more than just the characters. My actual frame rate on the other side isn't close to this; it's somewhere around 20 (according to the engines timing system), and it's clearly noticeable. What the engine shows me is the last post effect (which renders to the back buffer) takes about 40 ms. I would assume this is because the GPU can't finish the frame in time and has to wait for all commands to be executed before it can present to the screen. Doesn't this indicate that the skinning is a computationally complex shader, or is there something less intuitive lurking about? The profiler in PerfStudio says:

              Application is bound by draw operations, but the GPU is not fully utilized.

              Draw calls = 675

              I also noticed that PerfStudio showed me the long stretch of time between the last draw call and the time where SwapBuffers takes place:



              What is happening here? I have checked to make sure that the engine itself doesn't do anything to postpone this operation, and only seems to appear whenever I render using a slightly more complex vertex shader.


              Is it perhaps faster to render to a frame buffer and then blit with the back buffer? Any suggestions on how to mitigate this problem are greatly appreciated.


              Edit: It just hit me that I might just be rendering too much geometry. I will try with a simpler character model.


              Edit 2: Okay, turned out that a simpler character worked much faster. I could easily render 256 without any performance issues. I am sorry for not even considering this simple and obvious issue...