mede

Horrible OpenGL performance on new macbook (Nvidia to Radeon)

Discussion created by mede on Jan 26, 2018
Latest reply on Feb 5, 2018 by mede

I recently got a new macbook pro to replace my 4 years old model. I was locking forward to get more performance for our virtual reality volume rendering project. But sadly the mostly all our shaders are around half as fast compared to the older nvidia card in the 2014 macbook.

Just reading the specs the Radeon card should be much faster: Radeon Pro 560 1024@907MHz / GeForce GT 750M 384@967MHz

 

I tried to find some of the larges performance gaps:

1. Volume Rotation

This simple fragment shader to rotate a 3d texture uses 30ms on the Nvidia card and 1500ms ! on the Radeon, which is 50 times more !!!

The shader is used together with a geometry shader and the OpenGL call glDrawArraysInstanced to perform the rotation slice wise at once.

 

Fragment

uniform sampler3D uCube;
uniform mat4 uTransform;
in vec3     gTexCoord;
out vec4    FragColor0;
void main()
{
        FragColor0 = texture(uCube, (uTransform*vec4(gTexCoord, 1.0)).xyz).rgba;
}

 

Geometry

layout(triangles) in;
layout(triangle_strip, max_vertices = 3) out;
flat in int vInstanceID[3];
in vec2 vTexCoord[3];
out vec3 gTexCoord;
uniform int uInstanceScale;
void main(void)
{
    for (int i = 0; i < 3; ++i) {
        gl_Position = gl_in[i].gl_Position;
        gl_Layer = vInstanceID[i];
        gTexCoord = vec3(vTexCoord[i], (float(vInstanceID[i]) + 0.5)/uInstanceScale);
        EmitVertex();
    }
}

 

Vertex

in vec3 Position;
in vec2 TexCoord0;
flat out int vInstanceID;
out vec2 vTexCoord;
void main()
{
        vTexCoord = TexCoord0;
        vInstanceID = gl_InstanceID;
        gl_Position = vec4(Position, 1.0);
}

 

2. 3D Texture Arrays

Another large drawback for our project is the missing feature of 3D Texture arrays on ATI GPUs.

The direct lookup is way faster than the switch workaround for ATI.

uniform sampler3D uTransmittance[NumSamplers];
vec4 getTransmittance(int index, vec3 pos) {
#ifdef VENDOR_ATI
    switch(index) {
        case 0: return texture(uTransmittance[0], pos);
        case 1: return texture(uTransmittance[1], pos);
        case 2: return texture(uTransmittance[2], pos);
        case 3: return texture(uTransmittance[3], pos);
        case 4: return texture(uTransmittance[4], pos);
        case 5: return texture(uTransmittance[5], pos);
        case 6: return texture(uTransmittance[6], pos);
        case 7: return texture(uTransmittance[7], pos);
    }
#else
    return texture(uTransmittance[index], pos);
#endif
}

 

3. Dynamic Branching

Also the "normal" lightning shader used for our 3D scene (simple room) are running way slower (half the speed). A large difference I found for a switch statement we use to calculate diffuse portal lightning.

 

For the NVidia hardware it makes no difference (speed wise), if I use the switch or just replace it with e.g. the source block of case 1... On the Radeon

the performance drop using the switch is very large (4 times slower)! Is there some optimisation needed for the radeon ?

void ltcClipQuadToHorizon(inout vec3 L[5], out int n) {
    int config = 0;
    if (L[0].z > 0.0) config += 1;
    if (L[1].z > 0.0) config += 2;
    if (L[2].z > 0.0) config += 4;
    if (L[3].z > 0.0) config += 8;
    switch (config) {
    case 0:
        n = 0;
        break;
    case 1:
        n = 3;
        L[1] = -L[1].z * L[0] + L[0].z * L[1];
        L[2] = -L[3].z * L[0] + L[0].z * L[3];
        L[3] = L[0];
        break;
    case 2:
        n = 3;
        L[0] = -L[0].z * L[1] + L[1].z * L[0];
        L[2] = -L[2].z * L[1] + L[1].z * L[2];
        L[3] = L[0];
        break;
    case 3:
        n = 4;
        L[2] = -L[2].z * L[1] + L[1].z * L[2];
        L[3] = -L[3].z * L[0] + L[0].z * L[3];
        L[4] = L[0];
        break;

 

 //until 15 cases following ...

 

Conclusion

There were already early concerns about Apple placing a AMD/ATI GPU into their pro models. As there is no DX on MacOS the GPU will only be used for OpenGL. Analysing now the Performance of this Radon model I am very disappointed. Not only special shaders maybe optimised for the GForce are lagging. All Shaders including simple once like doing standard texture rotation or gradient calculation are slower ;(

 

Is there anything we do completely wrong or is AMD just that bad with OpenGL ???

Outcomes