cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

aqnuep
Journeyman III

Are GLSL switch-case statements optimized as jump tables?

Hi,

I would like to implement an OpenGL based deferred renderer that uses multiple materials and the material indices come from one of the textures of the G-buffer.

The classical way to solve this is to use a switch-case construct like the following:

switch (materialID) {

    case 1: // do material #1 stuff

    case 2: // do material #2 stuff

    .........

}

Now, with OpenGL 4.0 we are in a better situation as we can solve the same thing with subroutine uniforms by putting all possible subroutines in an array and then simply index it:

subroutine uniform RoutineType DoMaterial[100];

.........

DoMaterial[materialID]();

The problem with this solution is that each time you change shaders you have to respecify a potentially large number of subroutine indices (and they are always the same). This is a kind of silly API limitation (I don't think that this has anything to do with HW support but maybe I'm wrong).

Because I'm worried about the overhead of respecifying all the subroutines (maybe several subroutine arrays: one for material, one for light and so on), I was thinking about maybe the switch-case construct may also work with constant time at least on HD5000 series cards by optimizing the switch-case into a jump table in the same style like usual compiler like GCC do with CPU code.

So my questions are the following:

Are GLSL switch-case statements optimized as jump tables on Evergreen GPUs in order to be executed in constant time? If yes, are they also optimized on earlier GPU generations?

Are glUniformSubroutines calls expensive operations if I have to pass hundreds of subroutines?

Is there any other possibility to solve the same issue in constant time?

0 Likes
10 Replies
pboudier
Staff

you can look at gpushader analyzer to verify the generated code. but the switch-statement into jump table optimization is not done by our driver.

if you use subroutine, could you setup all of them at once, instead of switching them for each shader ?

 

another approach is to use texture arrays if this can represent some of the work you need to perform per light/material.

Pierre B.

0 Likes

you can look at gpushader analyzer to verify the generated code. but the switch-statement into jump table optimization is not done by our driver.


Thanks for the idea. I'll check next time first this way. However, if it is possible (what I'm pretty sure it is on HD5000+ GPUs) I would recommend you to do such optimization in the GLSL compiler of the driver as it could speed up things alot (and if NVIDIA does nothing like that, then you can get an advantage ).

if you use subroutine, could you setup all of them at once, instead of switching them for each shader ?


I don't quite understand this. Actually configuring subroutine indices is an all or nothing decision for a particular shader stage.

If you mean whether I cache the array of indices between shader switches then yes, I do so, so the only API call that I issue when changing shaders is glUniformSubroutines, however I don't know anything about the performance characteristics of that. Besides, you have to call it several times, once for each shader stage.

another approach is to use texture arrays if this can represent some of the work you need to perform per light/material.


I think you mean here e.g. putting e.g. the BRDFs into a texture cube map array and then simply fetch textures based on the material/light. While this would be quite fast, it would limit the shading code to what can be stored in textures and would not allow complete freedom on the lighting equation and the number and type of input attributes for the lighting equation. So in my case this wouldn't be an acceptable trade-off.

0 Likes

"the issue about relying on the driver GLSL compiler to optimize this switch, is that we would steal resource from the subroutine HW, which you could not use anymore.

 

if your issue with subroutines this statement in the spec:

"When UseProgram is called, the subroutine uniforms for all shader stages

are reset to arbitrarily chosen default functions with compatible subroutine types. When UseShaderProgramEXT is called, the subroutine uniforms for the shader stage specified by are reset to arbitrarily chosen default functions with compatible subroutine types.
"
?
this API basically will upload those values to the GPU, but like any OGL API call, the devil is in how many times you call it.
Pierre B.

 

0 Likes

the issue about relying on the driver GLSL compiler to optimize this switch, is that we would steal resource from the subroutine HW, which you could not use anymore.


I understand that but I assumed that subroutines are implemented in hardware as real function pointers and as such should not consume any hardware resources.

From this statement I think that they are not implemented that way, rather they are implemented with some hardware magic. Why they are not implemented as function pointers? Are there performance issues of implementing such functionality on GPUs?

if your issue with subroutines this statement in the spec:

........

this API basically will upload those values to the GPU, but like any OGL API call, the devil is in how many times you call it.



Yes, I meant that statement. This means each time I use BindProgramPipeline, UseProgram or UseShaderProgramEXT then I have to call glUniformSubroutines with all the indices as many times as shader stages I have. This seems to me quite impractical.

Can you tell whether calling e.g. 5 times glUniformSubroutines (e.g. for all the shader stages) with the maximum number of subroutine indices (e.g. 256) is a heavyweight or lightweight call compared to BindProgramPipeline or UseProgram?

If, let's say, the 5 calls take several times less than the switching of the shaders then I think it should not be a problem but if the time the driver needs to update the subroutine uniform information is comparable to the time needed to switch shaders then that completely defeats the purpose of subroutines, at least in case of my use case.

0 Likes

 

yes, there can be magic hardware to make function calls go faster, which is why there is a limit on how many you can use.

 

in terms of relative cost:

- useprogram will trigger validation for input streams, all textures (check for completeness), all uniforms, eventually fbo output, so is quite heavy.

- uniformsubroutines will mostly be a copy of all the indices; 5x 256 dwords will cost at least that much data to be copied several times (update the hardware, update the cpu states, ...)

it is hard to predict which one will be faster though.

maybe the answer would be to avoid the "reset to default" behavior between shader switches.

Pierre B.

 

0 Likes

maybe the answer would be to avoid the "reset to default" behavior between shader switches.


How you can avoid it?

The spec says that each time you switch shaders, everything is reset. So one cannot assume that the values are kept between shader switches.

What is even worse that even if I change only the fragment shader, as an example, I have to reload the subroutines of other stages as well, at least in case of UseProgram or BindProgramPipeline.

I was thinking about whether rather than binding separate program pipelines I should choose just to update the program pipeline itself by attaching another shader program to it using UseProgramStages however I don't know how performance is affected by this. BTW, this raises another question. Which of the following is recommended?

1. Having separate program pipeline objects for all shader program combinations and just bind them using BindProgramPipeline at runtime to replace the whole set, or

2. Having just one program pipeline object that is bound and just use UseProgramStages to update the currently bound program pipeline object with the proper shader programs.

From API call point of view, BindProgramPipeline seems favorable but it may need the validation of all the shader stages.

On the other hand, UseProgramStages just updates the specified shader stages so it should be faster to validate them, however you have to update a bound object which may incur some performance hit.

0 Likes

it would be valuable if you could provide some data showing what is the scale of the issue in your case. to avoid rebinding the subroutines, we could make it persistent state.

 

I don't have a generic answer on your other question; it really depends on how much over validation would happen. (let's say that you use all 5 stages, and only ever change the PS, then just changing that one is better. if you always change everything, then UseProgram should be better)

0 Likes

it would be valuable if you could provide some data showing what is the scale of the issue in your case.


Actually I thinking in the followings:

Every object has a configuration set which consists mainly of integers that are used to lookup into subroutine uniform arrays.

Each shader has a set of subroutines, all of them assigned to a particular index of a subroutine uniform array (they are always the same and every subroutine is reached run-time using indices into the subroutine uniform arrays).

In the vertex shader, every object can select their own subroutine from a set that executes the vertex transform (e.g. one subroutine may just use a simple modelview transformation, other may perform skeletal animation, etc.). Besides that there may be other subroutine sets for other purposes.

In the fragment shader, every object can select their own subroutine from a set that performs the material related calculations (e.g. one may use Phong illumination, another use Ward illumination, the third one uses some Fresnel based material, etc.). Also every light can select its own subroutine from a set that calculates the incoming light amount (e.g. point light, spot light, directional light, area light, etc.). Maybe other subroutine sets could come handy as well.

Besides these, maybe other stages (e.g. geometry shader) can have some unique set of subroutines if needed.

As any of the shader invocations can use any of the subroutines as it is decided in runtime using a lookup into the texture buffer holding the object related subroutine indices or light related subroutine indices, during a shader switch I would have to update all the arrays holding all the possible subroutine indices for all shader stages. These are all static data, that means they do not change between frames as an example.

This would allow incredible flexibility from the point of view of how often one needs state changes and huge batches can be grouped together. Even objects of various type, with various materials, etc.

Actually this means that I also have to switch shaders only a few times but in such cases I have to reload a large number of subroutine indices.

to avoid rebinding the subroutines, we could make it persistent state


How do you mean that? By changing the spec? That would be actually very nice. In fact, there wouldn't be need to change the spec that the subroutine states are reset, there should be just a possibility to initialize subroutine uniforms inside shader code so that they always point to the same function. So if they could be initialized in shader code (even if it is not possible to change it afterwards) then it would solve my problem.

let's say that you use all 5 stages, and only ever change the PS, then just changing that one is better. if you always change everything, then UseProgram should be better


So maybe I should just go with only a single program pipeline object (GL_ARB_separate_shader_objects) and switch the shader programs attached using UseProgramStages. I was only worried about that it is inefficient to modify a currently bound program pipeline. Actually if this is true then I'm wondering why the ARB decided to introduce program pipeline objects as they are just pure binding state objects.

0 Likes

Actually this means that I also have to switch shaders only a few times but in such cases I have to reload a large number of subroutine indices.


this was the rationale for not having a better API for subroutines (we had raised that issue too when discussing this extension)
Pierre B.

 



0 Likes

this was the rationale for not having a better API for subroutines (we had raised that issue too when discussing this extension)


I understand that, but still not convinced that it was the right decision.

I suppose it was finally chosen to do it this way because D3D 11 did it in a similar way as there you actually have to pass the subroutine indices when you set the shader.

0 Likes