Archives Discussions

omion · ‎05-08-2011

I'm doing a test to see how fast private arrays are, and there seem to be a lot of inefficiencies in the generated assembly code.

A simple example kernel:

__kernel void k_priv(__global int *out, __global int *in) {
__private int4 lookup[] = {
(int4)(1,2,3,4),
(int4)(5,6,7,8),
(int4)(9,10,11,12),
(int4)(13,14,15,16)
};
out[get_global_id(0)] = lookup[in[get_global_id(0)]].x;
}

The generated Cypress assembly is attached.

It appears that the compiler does 4 things:
1. Move -1 to a temporary register (R2)
2. Move the correct constants to temporary registers (R1,R3,R4,R5)
3. Move R2 to the final registers (R8,R9,R10,R11)
4. Move the correct constants from R1,R3,R4,R5 to R8,R9,R10,R11

It seems to me that simply moving the correct constants directly from immediates to the final registers would save at least 4 instructions (maybe up to 😎 out of 19.

Is there some reason the compiler does this roundabout initialization, or is it a bug?

; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(38) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: MOV R2.x, -1 y: MOV R2.y, -1 z: MOV R2.z, -1 w: MOV R2.w, -1 t: MOV R3.x, (0x0000000D, 1.821688004e-44f).x 1 x: MOV R1.x, (0x00000009, 1.261168618e-44f).x y: MOV R3.y, (0x0000000E, 1.961817850e-44f).y z: MOV R3.z, (0x0000000F, 2.101947696e-44f).z w: MOV R3.w, (0x00000010, 2.242077543e-44f).w t: MULLO_INT ____, R1.x, KC0[1].x 2 x: MOV R4.x, (0x00000005, 7.006492322e-45f).x y: MOV R1.y, (0x0000000A, 1.401298464e-44f).y z: MOV R1.z, (0x0000000B, 1.541428311e-44f).z w: ADD_INT ____, R0.x, PS1 t: MOV R1.w, (0x0000000C, 1.681558157e-44f).w 3 x: ADD_INT ____, PV2.w, KC0[6].x y: MOV R4.y, (0x00000006, 8.407790786e-45f).x z: MOV R4.z, (0x00000007, 9.809089250e-45f).y w: MOV R4.w, (0x00000008, 1.121038771e-44f).z t: MOV R5.x, (0x00000001, 1.401298464e-45f).w 4 y: MOV R5.y, (0x00000002, 2.802596929e-45f).x z: LSHL ____, PV3.x, (0x00000002, 2.802596929e-45f).x w: MOV R5.w, (0x00000004, 5.605193857e-45f).y t: MOV R5.z, (0x00000003, 4.203895393e-45f).z 5 y: ADD_INT ____, KC1[0].x, PV4.z w: ADD_INT ____, KC1[1].x, PV4.z 6 x: LSHR R0.x, PV5.y, (0x00000002, 2.802596929e-45f).x z: LSHR R0.z, PV5.w, (0x00000002, 2.802596929e-45f).x 01 TEX: ADDR(112) CNT(1) 7 VFETCH R0.___x, R0.z, fc173 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(70) CNT(38) 8 x: MOV R8.x, R2.x y: MOV R8.y, R2.y z: MOV R8.z, R2.z w: MOV R8.w, R2.w 9 x: MOV R9.x, R2.x y: MOV R9.y, R2.y z: MOV R9.z, R2.z w: MOV R9.w, R2.w 10 x: MOV R10.x, R2.x y: MOV R10.y, R2.y z: MOV R10.z, R2.z w: MOV R10.w, R2.w 11 x: MOV R11.x, R2.x y: MOV R11.y, R2.y z: MOV R11.z, R2.z w: MOV R11.w, R2.w 12 x: MOV R11.x, R3.x y: MOV R11.y, R3.y z: MOV R11.z, R3.z w: MOV R11.w, R3.w 13 x: MOV R10.x, R1.x y: MOV R10.y, R1.y z: MOV R10.z, R1.z w: MOV R10.w, R1.w 14 x: MOV R9.x, R4.x y: MOV R9.y, R4.y z: MOV R9.z, R4.z w: MOV R9.w, R4.w 15 x: MOV R8.x, R5.x y: MOV R8.y, R5.y z: MOV R8.z, R5.z w: MOV R8.w, R5.w 16 y: LSHL ____, R0.w, (0x00000004, 5.605193857e-45f).x 17 x: ASHR R5.x, PV16.y, (0x00000004, 5.605193857e-45f).x 18 x: MOVA_INT ____, R5.x 19 x: MOV R5.x, R8[A0.x].x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R0].x___, R5, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

Jawed · ‎05-09-2011

Superfluous MOVs are a very common problem. It's been like this for years.

MicahVillmow · ‎05-09-2011

omion,
Private arrays accesses that are not moved into registers are equivalent to uncached global accesses.

omion · ‎05-09-2011

@Micah,

I know that private arrays are slow if they overflow, but I'm talking about if they fit entirely into registers.

My actual code ended up being pretty slow (which I was half expecting - it uses about 80 GPRs). I looked at the assembly and discovered that the vast majority of the instructions were moving my constant data between registers. So I'm now wondering how much of the time is wasted on the superfluous MOVs, and how much of it is because I can only run 2 wavefronts / CU.

adm271828 · ‎05-09-2011

Originally posted by: MicahVillmow omion, Private arrays accesses that are not moved into registers are equivalent to uncached global accesses.

Micah,

do you mean this is a normal expected semantic as specified in the standard? Or just the way you implemented it, not being able to optimize it? I still don't understand why R8..11 is not directly initialized with the values.

Regards,

Antoine

MicahVillmow · ‎05-09-2011

only running 2 wavefronts/CU is causing a lot of stalls by not having enough wavefronts in flight to hide the clause latency.

MicahVillmow · ‎05-09-2011

This is a side effect of having the shader compiler being a JIT compiler and not running multiple passes multiple times. Using private arrays is never recommended because except for memory promotion to register, not many optimizations are run on code that originates from private arrays.

The first writes of the value -1 are initialization of the private array to a set value, the second set of copies are the read from private memory. SDK 2.5 will get rid of the initial set, but the second set are required.

Archives Discussions

Strange initialization of private arrays