Low Efficiency of OpenCL kernels

Discussion created by the729 on Jul 1, 2010
Latest reply on Jul 12, 2010 by the729
some insights about improving OCL kernel efficiency

Hi, AMD fellows,

I understand that kernels generated by OpenCL compiler may be less efficient than hand-written IL, since OpenCL C is a high-level language.  I also believe that the low-level shader compiler (IL->ISA) does intensive optimization on the code. However, something unexpected is happening.

I found the OpenCL llvm compiler uses x[] to store private variables, instead of directly using registers. The shader compiler will put x[] in the register if there is enough space, however, the optimization is not thorough. We can see quite a lot of ISA code like this:
    66  x: MOVA_INT    ____,  (0x00000008, 1.121038771e-44f).x     
    67  w: MOV         R2.w,  R10[A0.x].w     
which is definitely equal to :
    MOV         R2.w,  R18.w     
and this MOV may be further removed if only R2.w is used in the following code.

The shader compiler frequently overlook the static addressing to x[], and this results in a chain reaction of under-optimization. For example, loops are not unrolled, unnecessary code are not removed, the 5-way processors are seriously under utilized. 

I sincerely hope AMD fellows, especially the shader compiler team, look into this issue. I attached a simple matrix multiplication kernel to demonstrate this.

__kernel void main(__global float* mat4_0, __global float* mat0_0, __global float* mat3_0) { size_t vaTid; uint2 opos; float acc_10; float mat1_data_10; float mat2_data_10; float tmp_prod_14; vaTid = get_global_id(0); opos.xy = (uint2)(vaTid % 1024, vaTid / 1024); acc_10=0.0f; for (uint i_10=0; i_10<1024; i_10++) { mat1_data_10 = mat0_0[opos.y*1024+i_10]; mat2_data_10 = mat3_0[i_10*1024+opos.x]; tmp_prod_14=mat1_data_10*mat2_data_10; acc_10=acc_10+tmp_prod_14; } mat4_0[vaTid] = acc_10; }