cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dar1243
Journeyman III

How reliable are ShaderAnalyzer timings ?

Quick question:

How reliable are ShaderAnalyzer timings ?

I have identical shader in GLSL and HSLS and i'v done performance analysis on it, here are results:

 

radeon 4870 HD

DX9 -> 3.50 cycles  |  4.57 pixels/cycle  | 68 ALU
DX10-> 3.60 cycles  |  4.44 pixels/cycle  | 79 ALU
GLSL-> 4.00 cycles  |  4.00 pixels/cycle  | 82 ALU

radeon 4670 HD

DX9 -> 4.37 cycles  |  1.83 pixels/cycle  | 68 ALU
DX10-> 4.62 cycles  |  1.73 pixels/cycle  | 81 ALU
GLSL-> 4.59 cycles  |  1.74 pixels/cycle  | 82 ALU

 

Seems that DX9 is fastest, then DX10 and at the end GLSL on that same shader (? it is possible at all - the shader will run much slower on GLSL in real life situation ?)

Shader uses 5 texfetches and some alu calculations, input is 5 UV coords (float2), output is single float4, no branches, no integer (or integer like) operations, only mul/add/sqrt/pow/saturate/clamp used,

so why the analysis so differs between apis ?

(for dx9/10 optimalization is level3 - for glsl - no such a choice)

PS. On nvidia tools for shader analysys GLSL/DX9/DX10 performance is almost equal (the glsl is slightly better than DX9/DX10 profiles).

0 Likes
2 Replies
bpurnomo
Staff

They used different front-end compilers (DX9, DX10 and GLSL).  Also, the shader optimizations performed by our shader compiler can be different for these shaders.

 

0 Likes

so the GLSL frontend should be far far far more efficient than now - couse in average indentical shaders in glsl are about 25% slower than those in DX9 HLSL / DX10 HLSL. (With IMHO is HUGE difference)

The min/max vs. clamp thing should help, but not only this, another example:

HLSL DX9 (PS_3_0)

float4 main(float4 UV : TEXCOORD1) : COLOR0
{
 float4 Result;
 Result.w = 1.0;
 Result.xyz = UV.xyz / sqrt(UV.w);

 return Result;
}

microcode:

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(5)
      0  w: MOV         R1.w,  1.0f     
         t: RSQ_e       ____,  |R0.w|     
      1  x: MUL         R1.x,  R0.x,  PS0     
         y: MUL         R1.y,  R0.y,  PS0     
         z: MUL         R1.z,  R0.z,  PS0     
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM

GLSL

void main()
{
 gl_FragColor.w   = 1.0;
 gl_FragColor.xyz = gl_TexCoord[1].xyz / sqrt(gl_TexCoord[1].w);
}

; --------  Disassembly --------------------
00 ALU: ADDR(32) CNT(6)
      0  w: MOV         R1.w,  1.0f     
         t: SQRT_e      ____,  R0.w     
      1  t: RCP_e       ____,  PS0     
      2  x: MUL_e       R1.x,  R0.x,  PS1     
         y: MUL_e       R1.y,  R0.y,  PS1     
         z: MUL_e       R1.z,  R0.z,  PS1     
01 EXP_DONE: PIX0, R1
END_OF_PROGRAM

 

Why the hell GLSL version do not use faster invsquareroot opcode and istead use 2 slower sqrt and rcp opcodes ?!

PS. All above in RV770 assembly.

 

0 Likes