Archives Discussions

dar1243 · ‎07-16-2009

Quick question:

How reliable are ShaderAnalyzer timings ?

I have identical shader in GLSL and HSLS and i'v done performance analysis on it, here are results:

Seems that DX9 is fastest, then DX10 and at the end GLSL on that same shader (? it is possible at all - the shader will run much slower on GLSL in real life situation ?)

Shader uses 5 texfetches and some alu calculations, input is 5 UV coords (float2), output is single float4, no branches, no integer (or integer like) operations, only mul/add/sqrt/pow/saturate/clamp used,

so why the analysis so differs between apis ?

(for dx9/10 optimalization is level3 - for glsl - no such a choice)

PS. On nvidia tools for shader analysys GLSL/DX9/DX10 performance is almost equal (the glsl is slightly better than DX9/DX10 profiles).

bpurnomo · ‎07-23-2009

They used different front-end compilers (DX9, DX10 and GLSL). Also, the shader optimizations performed by our shader compiler can be different for these shaders.

dar1243 · ‎07-24-2009

so the GLSL frontend should be far far far more efficient than now - couse in average indentical shaders in glsl are about 25% slower than those in DX9 HLSL / DX10 HLSL. (With IMHO is HUGE difference)

The min/max vs. clamp thing should help, but not only this, another example:

HLSL DX9 (PS_3_0)

float4 main(float4 UV : TEXCOORD1) : COLOR0
{
float4 Result;
Result.w = 1.0;
Result.xyz = UV.xyz / sqrt(UV.w);

return Result;
}

microcode:

; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(5)
      0 w: MOV         R1.w, 1.0f
         t: RSQ_e       ____, |R0.w|
      1 x: MUL         R1.x, R0.x, PS0
         y: MUL         R1.y, R0.y, PS0
         z: MUL         R1.z, R0.z, PS0
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM

GLSL

void main()
{
gl_FragColor.w = 1.0;
gl_FragColor.xyz = gl_TexCoord[1].xyz / sqrt(gl_TexCoord[1].w);
}

; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(6)
      0 w: MOV         R1.w, 1.0f
         t: SQRT_e      ____, R0.w
      1 t: RCP_e       ____, PS0
      2 x: MUL_e       R1.x, R0.x, PS1
         y: MUL_e       R1.y, R0.y, PS1
         z: MUL_e       R1.z, R0.z, PS1
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM

Why the hell GLSL version do not use faster invsquareroot opcode and istead use 2 slower sqrt and rcp opcodes ?!

PS. All above in RV770 assembly.