2 Replies Latest reply on Jul 24, 2009 5:15 AM by dar1243

    How reliable are ShaderAnalyzer timings ?


      Quick question:

      How reliable are ShaderAnalyzer timings ?

      I have identical shader in GLSL and HSLS and i'v done performance analysis on it, here are results:


      radeon 4870 HD

      DX9 -> 3.50 cycles  |  4.57 pixels/cycle  | 68 ALU
      DX10-> 3.60 cycles  |  4.44 pixels/cycle  | 79 ALU
      GLSL-> 4.00 cycles  |  4.00 pixels/cycle  | 82 ALU

      radeon 4670 HD

      DX9 -> 4.37 cycles  |  1.83 pixels/cycle  | 68 ALU
      DX10-> 4.62 cycles  |  1.73 pixels/cycle  | 81 ALU
      GLSL-> 4.59 cycles  |  1.74 pixels/cycle  | 82 ALU


      Seems that DX9 is fastest, then DX10 and at the end GLSL on that same shader (? it is possible at all - the shader will run much slower on GLSL in real life situation ?)

      Shader uses 5 texfetches and some alu calculations, input is 5 UV coords (float2), output is single float4, no branches, no integer (or integer like) operations, only mul/add/sqrt/pow/saturate/clamp used,

      so why the analysis so differs between apis ?

      (for dx9/10 optimalization is level3 - for glsl - no such a choice)

      PS. On nvidia tools for shader analysys GLSL/DX9/DX10 performance is almost equal (the glsl is slightly better than DX9/DX10 profiles).

        • How reliable are ShaderAnalyzer timings ?

          They used different front-end compilers (DX9, DX10 and GLSL).  Also, the shader optimizations performed by our shader compiler can be different for these shaders.


            • How reliable are ShaderAnalyzer timings ?

              so the GLSL frontend should be far far far more efficient than now - couse in average indentical shaders in glsl are about 25% slower than those in DX9 HLSL / DX10 HLSL. (With IMHO is HUGE difference)

              The min/max vs. clamp thing should help, but not only this, another example:

              HLSL DX9 (PS_3_0)

              float4 main(float4 UV : TEXCOORD1) : COLOR0
               float4 Result;
               Result.w = 1.0;
               Result.xyz = UV.xyz / sqrt(UV.w);

               return Result;


              ; --------  Disassembly --------------------
              00 ALU: ADDR(32) CNT(5)
                    0  w: MOV         R1.w,  1.0f     
                       t: RSQ_e       ____,  |R0.w|     
                    1  x: MUL         R1.x,  R0.x,  PS0     
                       y: MUL         R1.y,  R0.y,  PS0     
                       z: MUL         R1.z,  R0.z,  PS0     
              01 EXP_DONE: PIX0, R1


              void main()
               gl_FragColor.w   = 1.0;
               gl_FragColor.xyz = gl_TexCoord[1].xyz / sqrt(gl_TexCoord[1].w);

              ; --------  Disassembly --------------------
              00 ALU: ADDR(32) CNT(6)
                    0  w: MOV         R1.w,  1.0f     
                       t: SQRT_e      ____,  R0.w     
                    1  t: RCP_e       ____,  PS0     
                    2  x: MUL_e       R1.x,  R0.x,  PS1     
                       y: MUL_e       R1.y,  R0.y,  PS1     
                       z: MUL_e       R1.z,  R0.z,  PS1     
              01 EXP_DONE: PIX0, R1


              Why the hell GLSL version do not use faster invsquareroot opcode and istead use 2 slower sqrt and rcp opcodes ?!

              PS. All above in RV770 assembly.