Archives Discussions

the729 · ‎07-01-2010

some insights about improving OCL kernel efficiency

Hi, AMD fellows,

I understand that kernels generated by OpenCL compiler may be less efficient than hand-written IL, since OpenCL C is a high-level language. I also believe that the low-level shader compiler (IL->ISA) does intensive optimization on the code. However, something unexpected is happening.

I found the OpenCL llvm compiler uses x[] to store private variables, instead of directly using registers. The shader compiler will put x[] in the register if there is enough space, however, the optimization is not thorough. We can see quite a lot of ISA code like this:
    66 x: MOVA_INT    ____, (0x00000008, 1.121038771e-44f).x
    67 w: MOV         R2.w, R10[A0.x].w
which is definitely equal to :
    MOV         R2.w, R18.w
and this MOV may be further removed if only R2.w is used in the following code.

The shader compiler frequently overlook the static addressing to x[], and this results in a chain reaction of under-optimization. For example, loops are not unrolled, unnecessary code are not removed, the 5-way processors are seriously under utilized.

I sincerely hope AMD fellows, especially the shader compiler team, look into this issue. I attached a simple matrix multiplication kernel to demonstrate this.

__kernel void main(__global float* mat4_0, __global float* mat0_0, __global float* mat3_0) { size_t vaTid; uint2 opos; float acc_10; float mat1_data_10; float mat2_data_10; float tmp_prod_14; vaTid = get_global_id(0); opos.xy = (uint2)(vaTid % 1024, vaTid / 1024); acc_10=0.0f; for (uint i_10=0; i_10<1024; i_10++) { mat1_data_10 = mat0_0[opos.y*1024+i_10]; mat2_data_10 = mat3_0[i_10*1024+opos.x]; tmp_prod_14=mat1_data_10*mat2_data_10; acc_10=acc_10+tmp_prod_14; } mat4_0[vaTid] = acc_10; }

MicahVillmow · ‎07-01-2010

the729,
Thanks for reporting this. If you have any other examples where optimizations are not producing what is expected, please let us know and we will look into it.

the729 · ‎07-02-2010

Thanks. Just clarify the problem. The optimizations do produce correct kernels, but not very efficient. Here is another example attached.

Please look at the first line of the second whileloop:
ilt r2.x___, r0.x000, x0[1].x000
if you change this line to:
ilt r2.x___, r0.x000, l16.x
which is exactly equivalent, the resulting ISA will be largely optimized.

il_ps_2_0 dcl_literal l0,0,0,0,0 dcl_literal l1,1,1,1,1 dcl_literal l2,2,2,2,2 dcl_literal l3,3,3,3,3 dcl_literal l4,4,4,4,4 dcl_literal l16,16,0,0,0 dcl_indexed_temp_array x0[16] dcl_input_position_interp(linear_noperspective) v0.xy__ ftoi r20.xy__, v0.xy00 ;dcl_input_generic vObjIndex0 mov r0.x___, l0 whileloop ilt r2.x___, r0.x000, l16.x000 break_logicalz r2 mov x0[r0.x], l0.xxxx iadd r0.x___, r0.x000, l1.x000 endloop mov r0.x___, l0 mov x0[1].x, l16.x whileloop ilt r2.x___, r0.x000, x0[1].x000 break_logicalz r2.x mov x0[0], g[r0.x] iadd r0.x___, r0.x000, l1.x000 endloop mov r0.x___, l0 mov r23.z, l0 mov g[r23.z+0], x0[r23.z+0] mov g[r23.z+1], x0[r23.z+1] mov g[r23.z+2], x0[r23.z+2] mov g[r23.z+3], x0[r23.z+3] mov g[r23.z+4], x0[r23.z+4] mov g[r23.z+5], x0[r23.z+5] mov g[r23.z+6], x0[r23.z+6] mov g[r23.z+7], x0[r23.z+7] mov g[r23.z+8], x0[r23.z+8] mov g[r23.z+9], x0[r23.z+9] mov g[r23.z+10], x0[r23.z+10] mov g[r23.z+11], x0[r23.z+11] mov g[r23.z+12], x0[r23.z+12] mov g[r23.z+13], x0[r23.z+13] mov g[r23.z+14], x0[r23.z+14] mov g[r23.z+15], x0[r23.z+15] endmain end

MicahVillmow · ‎07-06-2010

the729,
In our upcoming release your first example produces the following ISA:

; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MOV R5.x, 0.0f t: MULLO_INT ____, R1.x, KC0[1].x 1 x: MOV R2.x, 0.0f w: ADD_INT ____, R0.x, PS0 2 x: ADD_INT R3.x, PV1.w, KC0[6].x 3 x: AND_INT R4.x, PV2.x, (0x000003FF, 1.433528329e-42f).x z: AND_INT ____, PV2.x, (0x3FFFFC00, 1.99987793f).y 4 y: LSHL ____, PV3.z, (0x00000002, 2.802596929e-45f).x 5 x: ADD_INT R6.x, KC1[1].x, PV4.y 01 LOOP_NO_AL i1 FAIL_JUMP_ADDR(6) 02 ALU: ADDR(43) CNT(11) KCACHE0(CB1:0-15) 6 x: ADD_INT R6.x, R6.x, (0x00000004, 5.605193857e-45f).x y: LSHR R0.y, R6.x, (0x00000002, 2.802596929e-45f).y z: OR_INT ____, R4.x, R2.x VEC_120 t: ADD_INT R2.x, R2.x, (0x00000400, 1.434929627e-42f).z 7 y: LSHL ____, PV6.z, (0x00000002, 2.802596929e-45f).x 8 w: ADD_INT ____, KC0[2].x, PV7.y 9 z: LSHR R0.z, PV8.w, (0x00000002, 2.802596929e-45f).x 03 TEX: ADDR(80) CNT(2) 10 VFETCH R0.x___, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 11 VFETCH R1.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(54) CNT(1) 12 x: MULADD R5.x, R0.x, R1.x, R5.x 05 ENDLOOP i1 PASS_JUMP_ADDR(2) 06 ALU: ADDR(55) CNT(12) KCACHE0(CB1:0-15) 13 x: OR_INT ____, R4.x, R2.x z: LSHL ____, R3.x, (0x00000002, 2.802596929e-45f).x VEC_201 14 x: LSHR R0.x, R6.x, (0x00000002, 2.802596929e-45f).x y: ADD_INT ____, KC0[0].x, PV13.z w: LSHL ____, PV13.x, (0x00000002, 2.802596929e-45f).x 15 x: LSHR R2.x, PV14.y, (0x00000002, 2.802596929e-45f).x z: ADD_INT ____, KC0[2].x, PV14.w 16 y: LSHR R0.y, PV15.z, (0x00000002, 2.802596929e-45f).x 07 TEX: ADDR(84) CNT(2) 17 VFETCH R1.x___, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 18 VFETCH R0.x___, R0.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 08 ALU: ADDR(67) CNT(1) 19 x: MULADD R0.x, R1.x, R0.x, R5.x 09 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

the729 · ‎07-12-2010

So I should expect some performance improvement with 10.7 or 10.8 driver.

Thank you.

Archives Discussions

Low Efficiency of OpenCL kernels