So I know curiosity killed the cat, but I can't help it any more...

Right now, I'm working with a very very serial algorithm. Good news is we just run it on a lot of different data, roughly the same size, and not large at all that we get from a netowork line (the algorithm is complicated enough that this isn't a bottleneck). To parallelize this is simple, implement it to run multiple concurrent instances of the algorithm. So thats what I did, its a 32 bit algorithm, so pushing 4 instances through an SIMD seems the logical choice, but hold on, it seems at theast the VLIW4 and 5's aren't really SIMD, its really more like MIMD when you look at what IL generates.

So my question is this. First, is it really mislabled as SIMD when in fact, each element of the simd processor can execute different instructions making it MIMD, with optimizations (PV) for running SIMD, or is there more to it. Secondly, look at the attached code, I thought I was going to have a hard time getting the IL compiler to do this on nonsense code, but it turned out to be very easy. Is there something I can do to get it to keep my blocks of 4?

///////////////////////IL Code.... il_cs_2_0 dcl_num_thread_per_group 64 //We'll stick with the default sample value for this. 64 seems like a good number... dcl_raw_uav_id(11) dcl_raw_uav_id(8) dcl_cb cb0[1] dcl_literal l0, 0x00000010, 0, 0, 0, 0 imul r1000.x, cb0[0].x, vAbsTidFlat.x imul r1001.x, cb0[0].y, vAbsTidFlat.x uav_raw_load_id(11) r0, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r1, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r2, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r3, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r4, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r5, r1000.x iadd r1000.x, r1000.x, l0.x iand r0, r1, r2 ixor r1, r0, r2 iadd r2, r1, r3 ior r3, r2, r4 ixor r4, r3, r5 iand r5, r4, r0 ixor r0, r5, r1 iand r1, r0, r1 ior r2, r1, r2 ixor r3, r2, r4 uav_raw_store_id(8) mem, r1001.x, r0 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r1 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r2 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r3 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r4 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r5 iadd r1001.x, r1001.x, l0.x ret end ////////////////////////////////////ISA Code ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(4) 0 x: LSHL R4.x, R0.z, 6 z: MOV R2.z, 0.0f w: LSHL R0.w, R0.y, 6 01 TEX: ADDR(144) CNT(1) 1 VFETCH R2.xy__, R2.z, fc147 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(36) CNT(39) KCACHE0(CB0:0-15) 2 x: MULLO_UINT ____, R1.z, R2.x y: MULLO_UINT ____, R1.z, R2.x z: MULLO_UINT ____, R1.z, R2.x w: MULLO_UINT ____, R1.z, R2.x 3 x: MULLO_UINT R3.x, PV2.y, R2.y y: MULLO_UINT ____, PV2.y, R2.y z: MULLO_UINT ____, PV2.y, R2.y w: MULLO_UINT ____, PV2.y, R2.y 4 x: MULLO_UINT ____, R1.y, R2.x y: MULLO_UINT ____, R1.y, R2.x z: MULLO_UINT ____, R1.y, R2.x w: MULLO_UINT ____, R1.y, R2.x 5 x: ADD_INT ____, R4.x, R0.w z: ADD_INT ____, R3.x, PV4.w VEC_120 6 y: ADD_INT ____, R1.x, PV5.z w: ADD_INT R0.w, R0.x, PV5.x VEC_120 7 x: LSHL ____, PV6.y, 6 8 w: ADD_INT R5.w, R0.w, PV7.x 9 x: MULLO_INT ____, KC0[0].x, PV8.w y: MULLO_INT ____, KC0[0].x, PV8.w z: MULLO_INT ____, KC0[0].x, PV8.w w: MULLO_INT ____, KC0[0].x, PV8.w 10 x: ADD_INT ____, PV9.z, 16 11 z: ADD_INT ____, PV10.x, 16 w: LSHR R0.w, PV10.x, 2 12 x: ADD_INT ____, PV11.z, 16 y: LSHR R0.y, PV11.z, 2 13 z: ADD_INT ____, PV12.x, 16 w: LSHR R1.w, PV12.x, 2 14 x: ADD_INT ____, PV13.z, 16 y: LSHR R1.y, PV13.z, 2 15 w: LSHR R2.w, PV14.x, 2 03 TEX: ADDR(146) CNT(5) 16 VFETCH R3, R0.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 17 VFETCH R0, R0.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 18 VFETCH R4, R1.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 19 VFETCH R1, R1.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 20 VFETCH R2, R2.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(75) CNT(37) KCACHE0(CB0:0-15) 21 x: AND_INT R3.x, R3.z, R0.z y: AND_INT R3.y, R3.y, R0.y z: AND_INT R3.z, R3.x, R0.x w: AND_INT R3.w, R3.w, R0.w 22 x: XOR_INT R0.x, R0.z, PV21.x y: XOR_INT R0.y, R0.y, PV21.y z: XOR_INT R0.z, R0.x, PV21.z w: XOR_INT R0.w, R0.w, PV21.w 23 x: ADD_INT R4.x, R4.w, PV22.w y: ADD_INT R4.y, R4.y, PV22.y z: ADD_INT R4.z, R4.x, PV22.z w: ADD_INT R4.w, R4.z, PV22.x 24 x: OR_INT ____, R1.w, PV23.x y: OR_INT ____, R1.x, PV23.z z: OR_INT ____, R1.z, PV23.w w: OR_INT ____, R1.y, PV23.y 25 x: XOR_INT R1.x, R2.x, PV24.y y: XOR_INT R1.y, R2.y, PV24.w z: XOR_INT R1.z, R2.z, PV24.z w: XOR_INT R1.w, R2.w, PV24.x 26 x: MULLO_INT ____, KC0[0].y, R5.w y: MULLO_INT R2.y, KC0[0].y, R5.w z: MULLO_INT ____, KC0[0].y, R5.w w: MULLO_INT ____, KC0[0].y, R5.w 27 x: AND_INT R7.x, R3.z, R1.x y: AND_INT R7.y, R3.y, R1.y z: AND_INT R7.z, R3.x, R1.z w: AND_INT R7.w, R3.w, R1.w 28 x: LSHR R6.x, R2.y, 2 //<----------WHY?! y: XOR_INT R2.y, R0.y, PV27.y VEC_120 z: XOR_INT R2.z, R0.x, PV27.z w: XOR_INT R2.w, R0.w, PV27.w 29 x: XOR_INT R2.x, R0.z, R7.x <-----------This should be using PV! WHY?! y: AND_INT R3.y, R0.y, PV28.y z: AND_INT R3.z, R0.x, PV28.z w: AND_INT R3.w, R0.w, PV28.w 05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R2, ARRAY_SIZE(4) MARK VPM 06 ALU: ADDR(112) CNT(9) 30 x: AND_INT R3.x, R0.z, R2.x y: OR_INT R0.y, R4.y, R3.y z: OR_INT R0.z, R4.w, R3.z w: OR_INT R0.w, R4.x, R3.w 31 x: ADD_INT R2.x, R6.x, 4 y: XOR_INT R5.y, R1.y, PV30.y z: XOR_INT R5.z, R1.z, PV30.z w: XOR_INT R5.w, R1.w, PV30.w 07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R2], R3, ARRAY_SIZE(4) MARK VPM 08 ALU: ADDR(121) CNT(3) 32 x: OR_INT R0.x, R4.z, R3.x 33 x: ADD_INT R3.x, R6.x, 8 09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R3], R0, ARRAY_SIZE(4) MARK VPM 10 ALU: ADDR(124) CNT(3) 34 x: XOR_INT R5.x, R1.x, R0.x 35 x: ADD_INT R0.x, R6.x, 12 11 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R0], R5, ARRAY_SIZE(4) MARK VPM 12 ALU: ADDR(127) CNT(2) 36 x: ADD_INT R0.x, R6.x, 16 13 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R0], R1, ARRAY_SIZE(4) MARK VPM 14 ALU: ADDR(129) CNT(2) 37 x: ADD_INT R6.x, R6.x, 20 15 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R7, ARRAY_SIZE(4) MARK VPM 16 END END_OF_PROGRAM

Hah. This confused me to start with until i realised it doesn't really matter ...

The hardware 'stream core' (SC) is VLIW - which is MIMD, and exposed in the ISA.

But the `compute unit' is implemented as a 16-way SIMD processor. This SIMD'dness not exposed to the programmer directly: instead each lane of the SIMD == single work item 'thread'.

i.e. each of the 16 SCs within the cu executes the same instruction at the same time. But each instruction is a VLIW

And just to make it more interesting, the SIMD processors are then used to implement a SIMT (single instruction multiple thread) model as exposed by opencl gpu devices (either in hardware, or software, or both).

So it's all of them at once, VLIW/MIMD but only local to the current work-item, SIMD/SIMT is used to implement the workgroup. They're both correct since they're talking about different levels of hardware.

About the only real mis-label is 'thread', since a cpu thread is such a well-defined and existing label, and a gpu thread is nothing like it. Actually knowing it might be implemented as a SIMD lane makes those easier to understand too.