Hi, AMD fellows,
I understand that kernels generated by OpenCL compiler may be less efficient than hand-written IL, since OpenCL C is a high-level language. I also believe that the low-level shader compiler (IL->ISA) does intensive optimization on the code. However, something unexpected is happening.
I found the OpenCL llvm compiler uses x[] to store private variables, instead of directly using registers. The shader compiler will put x[] in the register if there is enough space, however, the optimization is not thorough. We can see quite a lot of ISA code like this:
66 x: MOVA_INT ____, (0x00000008, 1.121038771e-44f).x
67 w: MOV R2.w, R10[A0.x].w
which is definitely equal to :
MOV R2.w, R18.w
and this MOV may be further removed if only R2.w is used in the following code.
The shader compiler frequently overlook the static addressing to x[], and this results in a chain reaction of under-optimization. For example, loops are not unrolled, unnecessary code are not removed, the 5-way processors are seriously under utilized.
I sincerely hope AMD fellows, especially the shader compiler team, look into this issue. I attached a simple matrix multiplication kernel to demonstrate this.
__kernel void main(__global float* mat4_0, __global float* mat0_0, __global float* mat3_0) { size_t vaTid; uint2 opos; float acc_10; float mat1_data_10; float mat2_data_10; float tmp_prod_14; vaTid = get_global_id(0); opos.xy = (uint2)(vaTid % 1024, vaTid / 1024); acc_10=0.0f; for (uint i_10=0; i_10<1024; i_10++) { mat1_data_10 = mat0_0[opos.y*1024+i_10]; mat2_data_10 = mat3_0[i_10*1024+opos.x]; tmp_prod_14=mat1_data_10*mat2_data_10; acc_10=acc_10+tmp_prod_14; } mat4_0[vaTid] = acc_10; }
Thanks. Just clarify the problem. The optimizations do produce correct kernels, but not very efficient. Here is another example attached.
Please look at the first line of the second whileloop:
ilt r2.x___, r0.x000, x0[1].x000
if you change this line to:
ilt r2.x___, r0.x000, l16.x
which is exactly equivalent, the resulting ISA will be largely optimized.
il_ps_2_0 dcl_literal l0,0,0,0,0 dcl_literal l1,1,1,1,1 dcl_literal l2,2,2,2,2 dcl_literal l3,3,3,3,3 dcl_literal l4,4,4,4,4 dcl_literal l16,16,0,0,0 dcl_indexed_temp_array x0[16] dcl_input_position_interp(linear_noperspective) v0.xy__ ftoi r20.xy__, v0.xy00 ;dcl_input_generic vObjIndex0 mov r0.x___, l0 whileloop ilt r2.x___, r0.x000, l16.x000 break_logicalz r2 mov x0[r0.x], l0.xxxx iadd r0.x___, r0.x000, l1.x000 endloop mov r0.x___, l0 mov x0[1].x, l16.x whileloop ilt r2.x___, r0.x000, x0[1].x000 break_logicalz r2.x mov x0[0], g[r0.x] iadd r0.x___, r0.x000, l1.x000 endloop mov r0.x___, l0 mov r23.z, l0 mov g[r23.z+0], x0[r23.z+0] mov g[r23.z+1], x0[r23.z+1] mov g[r23.z+2], x0[r23.z+2] mov g[r23.z+3], x0[r23.z+3] mov g[r23.z+4], x0[r23.z+4] mov g[r23.z+5], x0[r23.z+5] mov g[r23.z+6], x0[r23.z+6] mov g[r23.z+7], x0[r23.z+7] mov g[r23.z+8], x0[r23.z+8] mov g[r23.z+9], x0[r23.z+9] mov g[r23.z+10], x0[r23.z+10] mov g[r23.z+11], x0[r23.z+11] mov g[r23.z+12], x0[r23.z+12] mov g[r23.z+13], x0[r23.z+13] mov g[r23.z+14], x0[r23.z+14] mov g[r23.z+15], x0[r23.z+15] endmain end
; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MOV R5.x, 0.0f t: MULLO_INT ____, R1.x, KC0[1].x 1 x: MOV R2.x, 0.0f w: ADD_INT ____, R0.x, PS0 2 x: ADD_INT R3.x, PV1.w, KC0[6].x 3 x: AND_INT R4.x, PV2.x, (0x000003FF, 1.433528329e-42f).x z: AND_INT ____, PV2.x, (0x3FFFFC00, 1.99987793f).y 4 y: LSHL ____, PV3.z, (0x00000002, 2.802596929e-45f).x 5 x: ADD_INT R6.x, KC1[1].x, PV4.y 01 LOOP_NO_AL i1 FAIL_JUMP_ADDR(6) 02 ALU: ADDR(43) CNT(11) KCACHE0(CB1:0-15) 6 x: ADD_INT R6.x, R6.x, (0x00000004, 5.605193857e-45f).x y: LSHR R0.y, R6.x, (0x00000002, 2.802596929e-45f).y z: OR_INT ____, R4.x, R2.x VEC_120 t: ADD_INT R2.x, R2.x, (0x00000400, 1.434929627e-42f).z 7 y: LSHL ____, PV6.z, (0x00000002, 2.802596929e-45f).x 8 w: ADD_INT ____, KC0[2].x, PV7.y 9 z: LSHR R0.z, PV8.w, (0x00000002, 2.802596929e-45f).x 03 TEX: ADDR(80) CNT(2) 10 VFETCH R0.x___, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 11 VFETCH R1.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(54) CNT(1) 12 x: MULADD R5.x, R0.x, R1.x, R5.x 05 ENDLOOP i1 PASS_JUMP_ADDR(2) 06 ALU: ADDR(55) CNT(12) KCACHE0(CB1:0-15) 13 x: OR_INT ____, R4.x, R2.x z: LSHL ____, R3.x, (0x00000002, 2.802596929e-45f).x VEC_201 14 x: LSHR R0.x, R6.x, (0x00000002, 2.802596929e-45f).x y: ADD_INT ____, KC0[0].x, PV13.z w: LSHL ____, PV13.x, (0x00000002, 2.802596929e-45f).x 15 x: LSHR R2.x, PV14.y, (0x00000002, 2.802596929e-45f).x z: ADD_INT ____, KC0[2].x, PV14.w 16 y: LSHR R0.y, PV15.z, (0x00000002, 2.802596929e-45f).x 07 TEX: ADDR(84) CNT(2) 17 VFETCH R1.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 18 VFETCH R0.x___, R0.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 08 ALU: ADDR(67) CNT(1) 19 x: MULADD R0.x, R1.x, R0.x, R5.x 09 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM
So I should expect some performance improvement with 10.7 or 10.8 driver.
Thank you.