Quick question:
How reliable are ShaderAnalyzer timings ?
I have identical shader in GLSL and HSLS and i'v done performance analysis on it, here are results:
radeon 4870 HD
DX9 -> 3.50 cycles | 4.57 pixels/cycle | 68 ALU
DX10-> 3.60 cycles | 4.44 pixels/cycle | 79 ALU
GLSL-> 4.00 cycles | 4.00 pixels/cycle | 82 ALU
radeon 4670 HD
DX9 -> 4.37 cycles | 1.83 pixels/cycle | 68 ALU
DX10-> 4.62 cycles | 1.73 pixels/cycle | 81 ALU
GLSL-> 4.59 cycles | 1.74 pixels/cycle | 82 ALU
Seems that DX9 is fastest, then DX10 and at the end GLSL on that same shader (? it is possible at all - the shader will run much slower on GLSL in real life situation ?)
Shader uses 5 texfetches and some alu calculations, input is 5 UV coords (float2), output is single float4, no branches, no integer (or integer like) operations, only mul/add/sqrt/pow/saturate/clamp used,
so why the analysis so differs between apis ?
(for dx9/10 optimalization is level3 - for glsl - no such a choice)
PS. On nvidia tools for shader analysys GLSL/DX9/DX10 performance is almost equal (the glsl is slightly better than DX9/DX10 profiles).
They used different front-end compilers (DX9, DX10 and GLSL). Also, the shader optimizations performed by our shader compiler can be different for these shaders.
so the GLSL frontend should be far far far more efficient than now - couse in average indentical shaders in glsl are about 25% slower than those in DX9 HLSL / DX10 HLSL. (With IMHO is HUGE difference)
The min/max vs. clamp thing should help, but not only this, another example:
HLSL DX9 (PS_3_0)
float4 main(float4 UV : TEXCOORD1) : COLOR0
{
float4 Result;
Result.w = 1.0;
Result.xyz = UV.xyz / sqrt(UV.w);
return Result;
}
microcode:
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(5)
0 w: MOV R1.w, 1.0f
t: RSQ_e ____, |R0.w|
1 x: MUL R1.x, R0.x, PS0
y: MUL R1.y, R0.y, PS0
z: MUL R1.z, R0.z, PS0
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM
GLSL
void main()
{
gl_FragColor.w = 1.0;
gl_FragColor.xyz = gl_TexCoord[1].xyz / sqrt(gl_TexCoord[1].w);
}
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(6)
0 w: MOV R1.w, 1.0f
t: SQRT_e ____, R0.w
1 t: RCP_e ____, PS0
2 x: MUL_e R1.x, R0.x, PS1
y: MUL_e R1.y, R0.y, PS1
z: MUL_e R1.z, R0.z, PS1
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM
Why the hell GLSL version do not use faster invsquareroot opcode and istead use 2 slower sqrt and rcp opcodes ?!
PS. All above in RV770 assembly.