2 Replies Latest reply on Feb 5, 2018 2:21 AM by mede

    Horrible OpenGL performance on new macbook (Nvidia to Radeon)

    mede

      I recently got a new macbook pro to replace my 4 years old model. I was locking forward to get more performance for our virtual reality volume rendering project. But sadly the mostly all our shaders are around half as fast compared to the older nvidia card in the 2014 macbook.

      Just reading the specs the Radeon card should be much faster: Radeon Pro 560 1024@907MHz / GeForce GT 750M 384@967MHz

       

      I tried to find some of the larges performance gaps:

      1. Volume Rotation

      This simple fragment shader to rotate a 3d texture uses 30ms on the Nvidia card and 1500ms ! on the Radeon, which is 50 times more !!!

      The shader is used together with a geometry shader and the OpenGL call glDrawArraysInstanced to perform the rotation slice wise at once.

       

      Fragment

      uniform sampler3D uCube;
      uniform mat4 uTransform;
      in vec3     gTexCoord;
      out vec4    FragColor0;
      void main()
      {
              FragColor0 = texture(uCube, (uTransform*vec4(gTexCoord, 1.0)).xyz).rgba;
      }

       

      Geometry

      layout(triangles) in;
      layout(triangle_strip, max_vertices = 3) out;
      flat in int vInstanceID[3];
      in vec2 vTexCoord[3];
      out vec3 gTexCoord;
      uniform int uInstanceScale;
      void main(void)
      {
          for (int i = 0; i < 3; ++i) {
              gl_Position = gl_in[i].gl_Position;
              gl_Layer = vInstanceID[i];
              gTexCoord = vec3(vTexCoord[i], (float(vInstanceID[i]) + 0.5)/uInstanceScale);
              EmitVertex();
          }
      }

       

      Vertex

      in vec3 Position;
      in vec2 TexCoord0;
      flat out int vInstanceID;
      out vec2 vTexCoord;
      void main()
      {
              vTexCoord = TexCoord0;
              vInstanceID = gl_InstanceID;
              gl_Position = vec4(Position, 1.0);
      }

       

      2. 3D Texture Arrays

      Another large drawback for our project is the missing feature of 3D Texture arrays on ATI GPUs.

      The direct lookup is way faster than the switch workaround for ATI.

      uniform sampler3D uTransmittance[NumSamplers];
      vec4 getTransmittance(int index, vec3 pos) {
      #ifdef VENDOR_ATI
          switch(index) {
              case 0: return texture(uTransmittance[0], pos);
              case 1: return texture(uTransmittance[1], pos);
              case 2: return texture(uTransmittance[2], pos);
              case 3: return texture(uTransmittance[3], pos);
              case 4: return texture(uTransmittance[4], pos);
              case 5: return texture(uTransmittance[5], pos);
              case 6: return texture(uTransmittance[6], pos);
              case 7: return texture(uTransmittance[7], pos);
          }
      #else
          return texture(uTransmittance[index], pos);
      #endif
      }

       

      3. Dynamic Branching

      Also the "normal" lightning shader used for our 3D scene (simple room) are running way slower (half the speed). A large difference I found for a switch statement we use to calculate diffuse portal lightning.

       

      For the NVidia hardware it makes no difference (speed wise), if I use the switch or just replace it with e.g. the source block of case 1... On the Radeon

      the performance drop using the switch is very large (4 times slower)! Is there some optimisation needed for the radeon ?

      void ltcClipQuadToHorizon(inout vec3 L[5], out int n) {
          int config = 0;
          if (L[0].z > 0.0) config += 1;
          if (L[1].z > 0.0) config += 2;
          if (L[2].z > 0.0) config += 4;
          if (L[3].z > 0.0) config += 8;
          switch (config) {
          case 0:
              n = 0;
              break;
          case 1:
              n = 3;
              L[1] = -L[1].z * L[0] + L[0].z * L[1];
              L[2] = -L[3].z * L[0] + L[0].z * L[3];
              L[3] = L[0];
              break;
          case 2:
              n = 3;
              L[0] = -L[0].z * L[1] + L[1].z * L[0];
              L[2] = -L[2].z * L[1] + L[1].z * L[2];
              L[3] = L[0];
              break;
          case 3:
              n = 4;
              L[2] = -L[2].z * L[1] + L[1].z * L[2];
              L[3] = -L[3].z * L[0] + L[0].z * L[3];
              L[4] = L[0];
              break;

       

       //until 15 cases following ...

       

      Conclusion

      There were already early concerns about Apple placing a AMD/ATI GPU into their pro models. As there is no DX on MacOS the GPU will only be used for OpenGL. Analysing now the Performance of this Radon model I am very disappointed. Not only special shaders maybe optimised for the GForce are lagging. All Shaders including simple once like doing standard texture rotation or gradient calculation are slower ;(

       

      Is there anything we do completely wrong or is AMD just that bad with OpenGL ???