4 Replies Latest reply on Oct 17, 2013 7:14 PM by lsalzman

    Massive CPU stalls changing shaders that access render target 3D textures in OpenGL


      So, I am the developer of the game/engine Tesseract (http://tesseract.gg)


      The basic background info to understand: it uses a tiled deferred rendering setup using OpenGL Core (3.0+) profile. The tile shader can batch up to 8 lights at once, plus a sunlight with indirect light sampling from a cascaded 3D texture.


      Now up to Catalyst 13.4 or so (the last version I had on my laptop with a 7340), everything was working fine. All other platforms, including, Nvidia, Intel, on both Windows and Linux were working fine and are still working fine.


      Then a user running Catalyst 13.9 on a 7770 reported massive slow-downs. So I upgraded to 13.9 on my 7340 laptop, and the same thing occurred there as well. I tracked this down and found a significant latency per frame entirely on the CPU (20-30ms!). I've boiled it down to the following minimal necessary conditions to cause it:


      Create 3D texture (format seems irrelevant, but RGBA8 for the moment, default size 32^3). Bind it to a framebuffer. No rendering to that texture seems to even be necessary. Use a fragment shader that samples that texture - doesn't need to do anything else inside it. Issue a draw call.


      It will cause a CPU stall dependent on the size of the texture, larger texture is a larger stall, but the stall is entirely on the CPU, as the GPU timers show no increase regardless of texture size. Just a few switches is enough to tank the framerate to unplayable.


      This has destroyed the usefulness of my tiled rendering setup on AMD GPUs, and I have found no viable workaround other than to not use tiled rendering at all. If I push the access to this 3D texture into a single fullscreen pass, it kind of works, but it means more superfluous accesses to the g-buffer/lighting passes and pushes up rendering times even more than I would like. Upon further inspection, while I've found that mitigated a lot of the cost, there does seem to be some significant latency (not quite 30ms, but up to maybe 5-7ms on the CPU) that is still around from changing the tile shader only a few tens of times per frame at most, which did not seem to exist on prior Catalysts and is not present in on any other GPU vendors on try on...


      It would be nice if the 3D texture stall could be fixed at least, as a 30ms latency there for switching a shader a few times does not seem right. Though even nicer if the overall latency of changing shaders that access render target textures could be fixed in general...

        • Re: Massive CPU stalls changing shaders that access render target 3D textures in OpenGL

          I should clarify, it is the repeated act of:


          Change to shader that uses the 3D texture.

          Issue draw.


          That incurs the stall. You only need to bind the 3D texture to a framebuffer object once, and never again. No need to change it or anything. Ever thereafter, on any subsequent frame, without that texture being modified at all, any shader change + draw call will incur the CPU stall so long as it uses that texture...

          • Re: Massive CPU stalls changing shaders that access render target 3D textures in OpenGL



            Just to be clear, binding the 3D texture to a framebuffer object will cause a stall when that 3D texture is sampled from, even if that FBO is not bound? Is the texture bound for reading and writing at the same time?


            Could you supply us with a test case that shows the issue?





              • Re: Massive CPU stalls changing shaders that access render target 3D textures in OpenGL

                Note that to even attach a 3D texture to an FBO, you have to bind the FBO. But we can assume the following sequence:


                glGenTextures(1, &tex);

                glBindTexture(GL_TEXTURE_3D, tex);

                glTexImage3D(GL_TEXTURE_3D, ...);

                glGenFramebuffers(1, &fbo);

                glBindFramebuffer(GL_FRAMEBUFFER, fbo);

                glFramebuffer3D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE3D, tex, 0, 0);

                glBindFramebuffer(GL_FRAMEBUFFER, 0);


                /* assume variants on a fragment shader like the following where the tex sampler is bound to texture unit 0 for this example:
                varying vec3 texcoord;

                uniform sampler3D tex;

                void main(void) { gl_FragColor = texture3D(tex, texcoord); }


                glBindTexture(GL_TEXTURE_3D, tex);

                for(...) { glUseProgram(program); glDrawElements(...); }


                Note that I tested entirely commenting out the code that did any actual rendering to the FBO. It still occurred. Then I skipped the attachment of this texture to the FBO, the problem went away (shader unmodified). I commented out just about everything in the shader until it was, essentially as above, just sampling the 3D texture and it still happened. I left everything in, but got rid of the 3D texture, and it went away. All indications point to this being essentially caused by the above.


                As tested on 13.4 and before, this was fine. In 13.9, it caused the above noted latencies. Just to check, I tried the 13.11 beta and the issue appears already fixed there. Does that mean I can assume that, going forward, it wouldn't be present in 13.11+, or is there the potential that it might crop up again by the time 13.11 sees release and you need an internal test case to verify still? I am uncertain...


                As for test case, I don't have a minimal one per se, but a way to at least quickly see the issue until or if a simpler test case could be made:


                Get the nightly build of Tesseract here: http://mappinghell.net/tesseract/tesseract-nightly.zip

                Extract, run tesseract.bat.

                Click Load Map, then click on the map complex.

                Enable timers in the HUD by typing: /timer 1


                You will then see a "deferred shading (cpu)" stat that will show the measured cpu time of the deferred shading pass that is the issue here. The shader code that samples the 3D texture can be put into its own prepass by typing:


                /batchsunlight 0


                Doing this you can observe the latency is greatly reduced, because a shader is only bound once during the deferred shading pass to use it, and do a fullscreen pass with it.


                To re-enable the way it works by default, where it gets accessed in each tile shader invocation, just:


                /batchsunlight 2