A simple example kernel:

__kernel void k_priv(__global int *out, __global int *in) {

__private int4 lookup[] = {

(int4)(1,2,3,4),

(int4)(5,6,7,8),

(int4)(9,10,11,12),

(int4)(13,14,15,16)

};

out[get_global_id(0)] = lookup[in[get_global_id(0)]].x;

}

The generated Cypress assembly is attached.

It appears that the compiler does 4 things:

1. Move -1 to a temporary register (R2)

2. Move the correct constants to temporary registers (R1,R3,R4,R5)

3. Move R2 to the final registers (R8,R9,R10,R11)

4. Move the correct constants from R1,R3,R4,R5 to R8,R9,R10,R11

It seems to me that simply moving the correct constants directly from immediates to the final registers would save at least 4 instructions (maybe up to 8) out of 19.

Is there some reason the compiler does this roundabout initialization, or is it a bug?

; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(38) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MOV R2.x, -1 y: MOV R2.y, -1 z: MOV R2.z, -1 w: MOV R2.w, -1 t: MOV R3.x, (0x0000000D, 1.821688004e-44f).x 1 x: MOV R1.x, (0x00000009, 1.261168618e-44f).x y: MOV R3.y, (0x0000000E, 1.961817850e-44f).y z: MOV R3.z, (0x0000000F, 2.101947696e-44f).z w: MOV R3.w, (0x00000010, 2.242077543e-44f).w t: MULLO_INT ____, R1.x, KC0[1].x 2 x: MOV R4.x, (0x00000005, 7.006492322e-45f).x y: MOV R1.y, (0x0000000A, 1.401298464e-44f).y z: MOV R1.z, (0x0000000B, 1.541428311e-44f).z w: ADD_INT ____, R0.x, PS1 t: MOV R1.w, (0x0000000C, 1.681558157e-44f).w 3 x: ADD_INT ____, PV2.w, KC0[6].x y: MOV R4.y, (0x00000006, 8.407790786e-45f).x z: MOV R4.z, (0x00000007, 9.809089250e-45f).y w: MOV R4.w, (0x00000008, 1.121038771e-44f).z t: MOV R5.x, (0x00000001, 1.401298464e-45f).w 4 y: MOV R5.y, (0x00000002, 2.802596929e-45f).x z: LSHL ____, PV3.x, (0x00000002, 2.802596929e-45f).x w: MOV R5.w, (0x00000004, 5.605193857e-45f).y t: MOV R5.z, (0x00000003, 4.203895393e-45f).z 5 y: ADD_INT ____, KC1[0].x, PV4.z w: ADD_INT ____, KC1[1].x, PV4.z 6 x: LSHR R0.x, PV5.y, (0x00000002, 2.802596929e-45f).x z: LSHR R0.z, PV5.w, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(112) CNT(1) 7 VFETCH R0.___x, R0.z, fc173 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(70) CNT(38) 8 x: MOV R8.x, R2.x y: MOV R8.y, R2.y z: MOV R8.z, R2.z w: MOV R8.w, R2.w 9 x: MOV R9.x, R2.x y: MOV R9.y, R2.y z: MOV R9.z, R2.z w: MOV R9.w, R2.w 10 x: MOV R10.x, R2.x y: MOV R10.y, R2.y z: MOV R10.z, R2.z w: MOV R10.w, R2.w 11 x: MOV R11.x, R2.x y: MOV R11.y, R2.y z: MOV R11.z, R2.z w: MOV R11.w, R2.w 12 x: MOV R11.x, R3.x y: MOV R11.y, R3.y z: MOV R11.z, R3.z w: MOV R11.w, R3.w 13 x: MOV R10.x, R1.x y: MOV R10.y, R1.y z: MOV R10.z, R1.z w: MOV R10.w, R1.w 14 x: MOV R9.x, R4.x y: MOV R9.y, R4.y z: MOV R9.z, R4.z w: MOV R9.w, R4.w 15 x: MOV R8.x, R5.x y: MOV R8.y, R5.y z: MOV R8.z, R5.z w: MOV R8.w, R5.w 16 y: LSHL ____, R0.w, (0x00000004, 5.605193857e-45f).x 17 x: ASHR R5.x, PV16.y, (0x00000004, 5.605193857e-45f).x 18 x: MOVA_INT ____, R5.x 19 x: MOV R5.x, R8[A0.x].x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R0].x___, R5, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

Superfluous MOVs are a very common problem. It's been like this for years.