14 Replies Latest reply on Oct 18, 2011 1:28 PM by corry

    IL compiler optimization curiosity

    corry

      So I know curiosity killed the cat, but I can't help it any more...

      Right now, I'm working with a very very serial algorithm. Good news is we just run it on a lot of different data, roughly the same size, and not large at all that we get from a netowork line (the algorithm is complicated enough that this isn't a bottleneck). To parallelize this is simple, implement it to run multiple concurrent instances of the algorithm. So thats what I did, its a 32 bit algorithm, so pushing 4 instances through an SIMD seems the logical choice, but hold on, it seems at theast the VLIW4 and 5's aren't really SIMD, its really more like MIMD when you look at what IL generates.

      So my question is this. First, is it really mislabled as SIMD when in fact, each element of the simd processor can execute different instructions making it MIMD, with optimizations (PV) for running SIMD, or is there more to it. Secondly, look at the attached code, I thought I was going to have a hard time getting the IL compiler to do this on nonsense code, but it turned out to be very easy. Is there something I can do to get it to keep my blocks of 4?

      ///////////////////////IL Code.... il_cs_2_0 dcl_num_thread_per_group 64 //We'll stick with the default sample value for this. 64 seems like a good number... dcl_raw_uav_id(11) dcl_raw_uav_id(8) dcl_cb cb0[1] dcl_literal l0, 0x00000010, 0, 0, 0, 0 imul r1000.x, cb0[0].x, vAbsTidFlat.x imul r1001.x, cb0[0].y, vAbsTidFlat.x uav_raw_load_id(11) r0, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r1, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r2, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r3, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r4, r1000.x iadd r1000.x, r1000.x, l0.x uav_raw_load_id(11) r5, r1000.x iadd r1000.x, r1000.x, l0.x iand r0, r1, r2 ixor r1, r0, r2 iadd r2, r1, r3 ior r3, r2, r4 ixor r4, r3, r5 iand r5, r4, r0 ixor r0, r5, r1 iand r1, r0, r1 ior r2, r1, r2 ixor r3, r2, r4 uav_raw_store_id(8) mem, r1001.x, r0 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r1 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r2 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r3 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r4 iadd r1001.x, r1001.x, l0.x uav_raw_store_id(8) mem, r1001.x, r5 iadd r1001.x, r1001.x, l0.x ret end ////////////////////////////////////ISA Code ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(4) 0 x: LSHL R4.x, R0.z, 6 z: MOV R2.z, 0.0f w: LSHL R0.w, R0.y, 6 01 TEX: ADDR(144) CNT(1) 1 VFETCH R2.xy__, R2.z, fc147 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(36) CNT(39) KCACHE0(CB0:0-15) 2 x: MULLO_UINT ____, R1.z, R2.x y: MULLO_UINT ____, R1.z, R2.x z: MULLO_UINT ____, R1.z, R2.x w: MULLO_UINT ____, R1.z, R2.x 3 x: MULLO_UINT R3.x, PV2.y, R2.y y: MULLO_UINT ____, PV2.y, R2.y z: MULLO_UINT ____, PV2.y, R2.y w: MULLO_UINT ____, PV2.y, R2.y 4 x: MULLO_UINT ____, R1.y, R2.x y: MULLO_UINT ____, R1.y, R2.x z: MULLO_UINT ____, R1.y, R2.x w: MULLO_UINT ____, R1.y, R2.x 5 x: ADD_INT ____, R4.x, R0.w z: ADD_INT ____, R3.x, PV4.w VEC_120 6 y: ADD_INT ____, R1.x, PV5.z w: ADD_INT R0.w, R0.x, PV5.x VEC_120 7 x: LSHL ____, PV6.y, 6 8 w: ADD_INT R5.w, R0.w, PV7.x 9 x: MULLO_INT ____, KC0[0].x, PV8.w y: MULLO_INT ____, KC0[0].x, PV8.w z: MULLO_INT ____, KC0[0].x, PV8.w w: MULLO_INT ____, KC0[0].x, PV8.w 10 x: ADD_INT ____, PV9.z, 16 11 z: ADD_INT ____, PV10.x, 16 w: LSHR R0.w, PV10.x, 2 12 x: ADD_INT ____, PV11.z, 16 y: LSHR R0.y, PV11.z, 2 13 z: ADD_INT ____, PV12.x, 16 w: LSHR R1.w, PV12.x, 2 14 x: ADD_INT ____, PV13.z, 16 y: LSHR R1.y, PV13.z, 2 15 w: LSHR R2.w, PV14.x, 2 03 TEX: ADDR(146) CNT(5) 16 VFETCH R3, R0.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 17 VFETCH R0, R0.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 18 VFETCH R4, R1.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 19 VFETCH R1, R1.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 20 VFETCH R2, R2.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(75) CNT(37) KCACHE0(CB0:0-15) 21 x: AND_INT R3.x, R3.z, R0.z y: AND_INT R3.y, R3.y, R0.y z: AND_INT R3.z, R3.x, R0.x w: AND_INT R3.w, R3.w, R0.w 22 x: XOR_INT R0.x, R0.z, PV21.x y: XOR_INT R0.y, R0.y, PV21.y z: XOR_INT R0.z, R0.x, PV21.z w: XOR_INT R0.w, R0.w, PV21.w 23 x: ADD_INT R4.x, R4.w, PV22.w y: ADD_INT R4.y, R4.y, PV22.y z: ADD_INT R4.z, R4.x, PV22.z w: ADD_INT R4.w, R4.z, PV22.x 24 x: OR_INT ____, R1.w, PV23.x y: OR_INT ____, R1.x, PV23.z z: OR_INT ____, R1.z, PV23.w w: OR_INT ____, R1.y, PV23.y 25 x: XOR_INT R1.x, R2.x, PV24.y y: XOR_INT R1.y, R2.y, PV24.w z: XOR_INT R1.z, R2.z, PV24.z w: XOR_INT R1.w, R2.w, PV24.x 26 x: MULLO_INT ____, KC0[0].y, R5.w y: MULLO_INT R2.y, KC0[0].y, R5.w z: MULLO_INT ____, KC0[0].y, R5.w w: MULLO_INT ____, KC0[0].y, R5.w 27 x: AND_INT R7.x, R3.z, R1.x y: AND_INT R7.y, R3.y, R1.y z: AND_INT R7.z, R3.x, R1.z w: AND_INT R7.w, R3.w, R1.w 28 x: LSHR R6.x, R2.y, 2 //<----------WHY?! y: XOR_INT R2.y, R0.y, PV27.y VEC_120 z: XOR_INT R2.z, R0.x, PV27.z w: XOR_INT R2.w, R0.w, PV27.w 29 x: XOR_INT R2.x, R0.z, R7.x <-----------This should be using PV! WHY?! y: AND_INT R3.y, R0.y, PV28.y z: AND_INT R3.z, R0.x, PV28.z w: AND_INT R3.w, R0.w, PV28.w 05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R2, ARRAY_SIZE(4) MARK VPM 06 ALU: ADDR(112) CNT(9) 30 x: AND_INT R3.x, R0.z, R2.x y: OR_INT R0.y, R4.y, R3.y z: OR_INT R0.z, R4.w, R3.z w: OR_INT R0.w, R4.x, R3.w 31 x: ADD_INT R2.x, R6.x, 4 y: XOR_INT R5.y, R1.y, PV30.y z: XOR_INT R5.z, R1.z, PV30.z w: XOR_INT R5.w, R1.w, PV30.w 07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R2], R3, ARRAY_SIZE(4) MARK VPM 08 ALU: ADDR(121) CNT(3) 32 x: OR_INT R0.x, R4.z, R3.x 33 x: ADD_INT R3.x, R6.x, 8 09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R3], R0, ARRAY_SIZE(4) MARK VPM 10 ALU: ADDR(124) CNT(3) 34 x: XOR_INT R5.x, R1.x, R0.x 35 x: ADD_INT R0.x, R6.x, 12 11 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R0], R5, ARRAY_SIZE(4) MARK VPM 12 ALU: ADDR(127) CNT(2) 36 x: ADD_INT R0.x, R6.x, 16 13 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R0], R1, ARRAY_SIZE(4) MARK VPM 14 ALU: ADDR(129) CNT(2) 37 x: ADD_INT R6.x, R6.x, 20 15 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R7, ARRAY_SIZE(4) MARK VPM 16 END END_OF_PROGRAM

        • IL compiler optimization curiosity
          notzed

          Hah. This confused me to start with until i realised it doesn't really matter ...

          The hardware 'stream core' (SC) is VLIW - which is MIMD, and exposed in the ISA.

          But the `compute unit' is implemented as a 16-way SIMD processor.  This SIMD'dness not exposed to the programmer directly: instead each lane of the SIMD == single work item 'thread'.

          i.e. each of the 16 SCs within the cu executes the same instruction at the same time.  But each instruction is a VLIW

          And just to make it more interesting, the SIMD processors are then used to implement a SIMT (single instruction multiple thread) model as exposed by opencl gpu devices (either in hardware, or software, or both).

          So it's all of them at once, VLIW/MIMD but only local to the current work-item, SIMD/SIMT is used to implement the workgroup.  They're both correct since they're talking about different levels of hardware.

          About the only real mis-label is 'thread', since a cpu thread is such a well-defined and existing label, and a gpu thread is nothing like it.  Actually knowing it might be implemented as a SIMD lane makes those easier to understand too.

           

            • IL compiler optimization curiosity
              corry

              Interesting, in college, I never looked into massivly parallel systems, so when I saw ATI talking about VLIW5, VLIW4, etc were marketing terms, or proprietary terms.  Thank goodness for the internet :)  I'm now somewhat more educated, so yes, I can see somewhat why its doing what its doing. 

              AAt least I know that all it's doing is screwing up the PV slots...latest optimizations dropped my register count to where I thought I could get another wavefront, but seems I can't, and the register count isn't going to get much lower, so I guess currently, I care much less, still, I really do think there should be an optimization to at least back off the optimization attempts...When the IL is already parallel, that shoud be enough...

            • IL compiler optimization curiosity
              LeeHowes

              VLIW is not SIMD. It's VLIW.

              It's not MIMD either, really, except in the loosest sense. The reason being that peoples interpretation of MIMD would be that there are different instruction streams, but VLIW is very fixed. The reality is that VLIW is issuing a single instruction, it just happens to pack multiple operations in that instruction.

              Incidentally, a GPU thread is a horrible mislabeling (I tried to make this this very clear in "Heterogeneous Computing with OpenCL", btw, my major goal with my parts of that book was to accurately describe the hardware tradeoffs) and the reality is that a wavefront/warp is a GPU thread. However, you can see where the people who use the term are coming from because:

              a) The programmer visible control flow sortof makes a work item a thread, but it's a *very* dangerous way of thinking if you try to do anything complicated because you can get yourself into nasty deadlock situations if you don't keep in mind that something that looks like a thread is not really a thread.

              b) The hardware automates some of the mask management and replay tricks such that you never need see the mask stack management code that is how the SIMD operation emulates very limited MIMD-like behaviour.

              You can't really back off all optimisation because the IL is only SIMD parallel, the shader compiler has to do all the VLIW and mini-vector-register packing for you. It's a bit of a pain but it was done to try to make IL reasonably future proof and scalable.

               

              ETA: VLIW isn't really a massively parallel thing. It's a power and transistor efficiency trick. A lot of DSPs are designed that way because you can pack instructions in without the complicated dependence analysis logic needed for full out-of-order superscalar as in the major high-performance CPUs. Itanium is the classic recent case of trying to scale it up to big CPUs (I love the Itanium 2... such a big pile of architectural fun).

                • IL compiler optimization curiosity
                  corry

                  You're right, it doesn't *need* to be massivly parallel, given though the initial unit build in the 80's was what, a 27 or 29 way system (I could go look it back up...) seems parallel computing was what they had in mind for it :) Of course in graphics with 4 element color and vertex data, it makes some sense...though SIMD systems usually would work just fine for the same thing...however, in massivly parallel general purpose compute systems, it really starts to make sense. SIMD when you want it, scalar when you don't, and the compiler will reorder for finding parallelism to save the transistor counts...yup, makes sense to me.

                  However, I'm unconvinced on the IL->ISA argument, and here is why. You stated IL is SIMD. As I said before, makes sense from a graphics perspective (or 3d simulations :) ), while underneath the hardware is VLIW4 in my case. Given VLIW4 can be said to be a superset of 4 element SIMD (call it SIMD4?), it seems there should be a 1-1 mapping of SIMD4->VLIW4, which has no dependencies, and has everything packed into vectors already, negating the need to the compiler to do any such looking. Indeed, even in the code I posted with this message, it seems the compiler is actually breaking the built in parallelism, probably because of a greedy algorithm packing the instructions. In that case, because the code other than the address register is SIMD4, placing the scaler code at the top of one of my already coded SIMD blocks, breaks the use of the PV register, and if I understand the docs right, means we have to use a GPR read, which can take some time, and also if I understand the docs correctly, that the PV reg is instantanious (the data is already there in the ALU, it doesn't have to be fetched from anywhere, not even another register). I'm not entirely clear on that though, so I suppose you could say that was half the reason for this thread :)

                  I know how things go though with developer time...there's a lot I'd like to play with these cards myself here, and things I think I could do...but, alas, there's always more work, and more important work...given my more recent understanding of the hardware though, I think it should be relatively easy to write an SIMD4 IL->VLIW4 code "translator" (rather than compiler). I never got to take the compilers course in college...the school, a very large well known school for that matter) couldn't find a prof to hire they felt was qualified to teach it! So I've never managed to play with lexical parsers, compiler compilers, etc...so I'd have a learning curve...I think I may be on my way to some time where personal progamming fun may be the only fun I can have for a week or so, so perhaps then I'll look into it...still sure would be nice if the built in compiler just had the option for it though!

                    • IL compiler optimization curiosity
                      gat3way

                      IL like PTX is not a real assembly. I guess the backend compiler might have done  that in order to reduce the number of clauses emitted because clause switches are relatively expensive. 

                        • IL compiler optimization curiosity
                          corry

                           

                          Originally posted by: gat3way IL like PTX is not a real assembly. I guess the backend compiler might have done that in order to reduce the number of clauses emitted because clause switches are relatively expensive.

                          Yes, of course its only a pseudo assembly, but most instructions do map 1-1, in a different format to the ISA, thus my statement.

                          Initially, I wanted to discount that theory, but thought, about a construct I have seen the compiler generate with multiple registers for addressing multiple reads/writes, so I decided to try to force it in that form at the IL level, that is to use the alu to calculate all the addresses prior to clause switching, then using those values. So I did it at the IL level. I think the code will say things more clearly than I can...at 28, you see for some reason it decides to do a lshr. I guess I'd have to open the docs up, but I'm already still at work an hour later than planned, but to see whats the max number of instructions in a clause, but 30 would seem a bit odd...I would expect a power of if anything...anyhow, as you can see at the end of the isa code, there are needlessly half empty (or half full if you're the optimistic type :) ) at the bottom. The addresses could have been precalculated, and I could add literals in there to make it so they aren't dependant on one another (I'll try that in a few minutes), so they can be computed simd fashion, so there's no reason to be inserting the adds in there...let me try the SIMD add though...

                           

                          this is good, its helping me figure out what the IL compiler is doing...let me say thanks before I forget! :)

                          il_cs_2_0 dcl_num_thread_per_group 64 //We'll stick with the default sample value for this. 64 seems like a good number... dcl_raw_uav_id(11) dcl_raw_uav_id(8) dcl_cb cb0[1] dcl_literal l0, 0x00000010, 0, 0, 0, 0 imul r1000.x, cb0[0].x, vAbsTidFlat.x imul r1002.x, cb0[0].y, vAbsTidFlat.x iadd r1000.y, r1000.x, l0.x iadd r1000.z, r1000.y, l0.x iadd r1000.w, r1000.z, l0.x iadd r1001.x, r1000.w, l0.x iadd r1001.y, r1001.x, l0.x uav_raw_load_id(11) r0, r1000.x uav_raw_load_id(11) r1, r1000.y uav_raw_load_id(11) r2, r1000.z uav_raw_load_id(11) r3, r1000.w uav_raw_load_id(11) r4, r1001.x uav_raw_load_id(11) r5, r1001.y iadd r1000.x, r1000.x, l0.x iand r0, r1, r2 ixor r1, r0, r2 iadd r2, r1, r3 ior r3, r2, r4 ixor r4, r3, r5 iand r5, r4, r0 ixor r0, r5, r1 iand r1, r0, r1 ior r2, r1, r2 ixor r3, r2, r4 iadd r1002.y, r1002.x, l0.x iadd r1002.z, r1002.y, l0.x iadd r1002.w, r1002.z, l0.x iadd r1003.x, r1003.w, l0.x iadd r1003.y, r1003.x, l0.x uav_raw_store_id(8) mem, r1002.x, r0 uav_raw_store_id(8) mem, r1002.y, r1 uav_raw_store_id(8) mem, r1002.z, r2 uav_raw_store_id(8) mem, r1002.w, r3 uav_raw_store_id(8) mem, r1003.x, r4 uav_raw_store_id(8) mem, r1003.y, r5 ret end ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(5) 0 x: LSHL R4.x, R0.z, 6 y: LSHR R3.y, 0.0f, 2 z: MOV R2.z, 0.0f w: LSHL R0.w, R0.y, 6 01 TEX: ADDR(144) CNT(1) 1 VFETCH R2.xy__, R2.z, fc147 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(37) CNT(43) KCACHE0(CB0:0-15) 2 x: MULLO_UINT ____, R1.z, R2.x y: MULLO_UINT ____, R1.z, R2.x z: MULLO_UINT ____, R1.z, R2.x w: MULLO_UINT ____, R1.z, R2.x 3 x: MULLO_UINT R3.x, PV2.y, R2.y y: MULLO_UINT ____, PV2.y, R2.y z: MULLO_UINT ____, PV2.y, R2.y w: MULLO_UINT ____, PV2.y, R2.y 4 x: MULLO_UINT ____, R1.y, R2.x y: MULLO_UINT ____, R1.y, R2.x z: MULLO_UINT ____, R1.y, R2.x w: MULLO_UINT ____, R1.y, R2.x 5 x: ADD_INT ____, R4.x, R0.w z: ADD_INT ____, R3.x, PV4.w VEC_120 6 x: ADD_INT R7.x, R3.y, 4 y: ADD_INT ____, R1.x, PV5.z w: ADD_INT R0.w, R0.x, PV5.x VEC_120 7 x: LSHL ____, PV6.y, 6 8 x: ADD_INT R9.x, R3.y, 8 w: ADD_INT R5.w, R0.w, PV7.x 9 x: MULLO_INT ____, KC0[0].x, PV8.w y: MULLO_INT ____, KC0[0].x, PV8.w z: MULLO_INT ____, KC0[0].x, PV8.w w: MULLO_INT ____, KC0[0].x, PV8.w 10 x: ADD_INT ____, PV9.z, 16 11 x: LSHR R0.x, PV10.x, 2 w: ADD_INT ____, PV10.x, 16 12 x: ADD_INT ____, PV11.w, 16 z: LSHR R0.z, PV11.w, 2 13 x: LSHR R1.x, PV12.x, 2 y: ADD_INT ____, PV12.x, 16 14 x: ADD_INT ____, PV13.y, 16 y: LSHR R3.y, PV13.y, 2 15 w: LSHR R0.w, PV14.x, 2 03 TEX: ADDR(146) CNT(5) 16 VFETCH R4, R0.x, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 17 VFETCH R2, R0.z, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 18 VFETCH R1, R1.x, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 19 VFETCH R3, R3.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 20 VFETCH R0, R0.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(80) CNT(37) KCACHE0(CB0:0-15) 21 x: AND_INT R4.x, R4.z, R2.z y: AND_INT R4.y, R4.y, R2.y z: AND_INT R4.z, R4.x, R2.x w: AND_INT R4.w, R4.w, R2.w 22 x: XOR_INT R2.x, R2.z, PV21.x y: XOR_INT R2.y, R2.y, PV21.y z: XOR_INT R2.z, R2.x, PV21.z w: XOR_INT R2.w, R2.w, PV21.w 23 x: ADD_INT R1.x, R1.w, PV22.w y: ADD_INT R1.y, R1.y, PV22.y z: ADD_INT R1.z, R1.x, PV22.z w: ADD_INT R1.w, R1.z, PV22.x 24 x: OR_INT ____, R3.w, PV23.x y: OR_INT ____, R3.x, PV23.z z: OR_INT ____, R3.z, PV23.w w: OR_INT ____, R3.y, PV23.y 25 x: XOR_INT R6.x, R0.x, PV24.y y: XOR_INT R6.y, R0.y, PV24.w z: XOR_INT R6.z, R0.z, PV24.z w: XOR_INT R6.w, R0.w, PV24.x 26 x: MULLO_INT ____, KC0[0].y, R5.w y: MULLO_INT R0.y, KC0[0].y, R5.w z: MULLO_INT ____, KC0[0].y, R5.w w: MULLO_INT ____, KC0[0].y, R5.w 27 x: AND_INT R8.x, R4.z, R6.x y: AND_INT R8.y, R4.y, R6.y z: AND_INT R8.z, R4.x, R6.z w: AND_INT R8.w, R4.w, R6.w 28 x: LSHR R3.x, R0.y, 2 y: XOR_INT R0.y, R2.y, PV27.y VEC_120 z: XOR_INT R0.z, R2.x, PV27.z w: XOR_INT R0.w, R2.w, PV27.w 29 x: XOR_INT R0.x, R2.z, R8.x y: AND_INT R4.y, R2.y, PV28.y z: AND_INT R4.z, R2.x, PV28.z w: AND_INT R4.w, R2.w, PV28.w 05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R3], R0, ARRAY_SIZE(4) MARK VPM 06 ALU: ADDR(117) CNT(9) 30 x: AND_INT R4.x, R2.z, R0.x y: OR_INT R2.y, R1.y, R4.y z: OR_INT R2.z, R1.w, R4.z w: OR_INT R2.w, R1.x, R4.w 31 x: ADD_INT R0.x, R3.x, 4 y: XOR_INT R5.y, R6.y, PV30.y z: XOR_INT R5.z, R6.z, PV30.z w: XOR_INT R5.w, R6.w, PV30.w 07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R0], R4, ARRAY_SIZE(4) MARK VPM 08 ALU: ADDR(126) CNT(3) 32 x: OR_INT R2.x, R1.z, R4.x 33 x: ADD_INT R4.x, R3.x, 8 09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R4], R2, ARRAY_SIZE(4) MARK VPM 10 ALU: ADDR(129) CNT(3) 34 x: XOR_INT R5.x, R6.x, R2.x 35 x: ADD_INT R3.x, R3.x, 12 11 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R3], R5, ARRAY_SIZE(4) MARK VPM 12 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R7], R6, ARRAY_SIZE(4) MARK VPM 13 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R9], R8, ARRAY_SIZE(4) MARK VPM 14 END END_OF_PROGRAM

                            • IL compiler optimization curiosity
                              corry

                              Argh, no avail....that lshr 2 is going to kill me every time....I am guessing this is because il uav addresses don't match up to RAT addresses, so it has to divide by 4?  Still, I think this in a clause by itslef, is better than the final logic operations in clauses by themselves.  Anyhow, I fixed the typo in the previous code too...still that lshr is going to bite me!

                              Here's some 100% simd code, except for the fact that uav code doesn't take a vector for its address (since of course that wouldn't really make sense...)

                              Edit: Managed to hit tab and enter or space at nearly the same time, and it somehow hit reply!  Yes, its most definately time to go home!!!

                              il_cs_2_0 dcl_num_thread_per_group 64 //We'll stick with the default sample value for this. 64 seems like a good number... dcl_raw_uav_id(11) dcl_raw_uav_id(8) dcl_cb cb0[1] dcl_literal l0, 0x00000000, 0x00000010, 0x00000020, 0x00000030 dcl_literal l1, 0x00000040, 0x00000050, 0x00000060, 0x00000040 imul r1000, cb0[0].xxxx, vAbsTidFlat.xxxx imul r1002, cb0[0].yyyy, vAbsTidFlat.xxxx mov r1001, r1000 mov r1003, r1002 iadd r1000, r1000, l0 iadd r1001, r1001, l1 iadd r1002, r1002, l0 iadd r1003, r1002, l1 uav_raw_load_id(11) r0, r1000.x uav_raw_load_id(11) r1, r1000.y uav_raw_load_id(11) r2, r1000.z uav_raw_load_id(11) r3, r1000.w uav_raw_load_id(11) r4, r1001.x uav_raw_load_id(11) r5, r1001.y iand r0, r1, r2 ixor r1, r0, r2 iadd r2, r1, r3 ior r3, r2, r4 ixor r4, r3, r5 iand r5, r4, r0 ixor r0, r5, r1 iand r1, r0, r1 ior r2, r1, r2 ixor r3, r2, r4 uav_raw_store_id(8) mem, r1002.x, r0 uav_raw_store_id(8) mem, r1002.y, r1 uav_raw_store_id(8) mem, r1002.z, r2 uav_raw_store_id(8) mem, r1002.w, r3 uav_raw_store_id(8) mem, r1003.x, r4 uav_raw_store_id(8) mem, r1003.y, r5 ret end ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(4) 0 x: LSHL R4.x, R0.z, 6 z: MOV R2.z, 0.0f w: LSHL R0.w, R0.y, 6 01 TEX: ADDR(144) CNT(1) 1 VFETCH R2.xy__, R2.z, fc147 FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(36) CNT(37) KCACHE0(CB0:0-15) 2 x: MULLO_UINT ____, R1.z, R2.x y: MULLO_UINT ____, R1.z, R2.x z: MULLO_UINT ____, R1.z, R2.x w: MULLO_UINT ____, R1.z, R2.x 3 x: MULLO_UINT R3.x, PV2.y, R2.y y: MULLO_UINT ____, PV2.y, R2.y z: MULLO_UINT ____, PV2.y, R2.y w: MULLO_UINT ____, PV2.y, R2.y 4 x: MULLO_UINT ____, R1.y, R2.x y: MULLO_UINT ____, R1.y, R2.x z: MULLO_UINT ____, R1.y, R2.x w: MULLO_UINT ____, R1.y, R2.x 5 x: ADD_INT ____, R4.x, R0.w z: ADD_INT ____, R3.x, PV4.w VEC_120 6 y: ADD_INT ____, R1.x, PV5.z w: ADD_INT R0.w, R0.x, PV5.x VEC_120 7 x: LSHL ____, PV6.y, 6 8 w: ADD_INT R5.w, R0.w, PV7.x 9 x: MULLO_INT ____, KC0[0].x, PV8.w y: MULLO_INT ____, KC0[0].x, PV8.w z: MULLO_INT R0.z, KC0[0].x, PV8.w w: MULLO_INT ____, KC0[0].x, PV8.w 10 x: ADD_INT ____, PV9.z, 16 y: ADD_INT R0.y, PV9.z, 64 z: ADD_INT ____, PV9.z, 48 w: ADD_INT ____, PV9.z, 32 11 x: ADD_INT ____, R0.z, 80 y: LSHR R1.y, PV10.z, 2 z: LSHR R0.z, PV10.w, 2 w: LSHR R0.w, PV10.x, 2 12 x: LSHR R0.x, R0.y, 2 w: LSHR R1.w, PV11.x, 2 03 TEX: ADDR(146) CNT(5) 13 VFETCH R3, R0.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 14 VFETCH R4, R0.z, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 15 VFETCH R2, R1.y, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 16 VFETCH R0, R0.x, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 17 VFETCH R1, R1.w, fc173 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(73) CNT(37) KCACHE0(CB0:0-15) 18 x: AND_INT R3.x, R3.z, R4.z y: AND_INT R3.y, R3.y, R4.y z: AND_INT R3.z, R3.x, R4.x w: AND_INT R3.w, R3.w, R4.w 19 x: XOR_INT R4.x, R4.z, PV18.x y: XOR_INT R4.y, R4.y, PV18.y z: XOR_INT R4.z, R4.x, PV18.z w: XOR_INT R4.w, R4.w, PV18.w 20 x: ADD_INT R2.x, R2.w, PV19.w y: ADD_INT R2.y, R2.y, PV19.y z: ADD_INT R2.z, R2.x, PV19.z w: ADD_INT R2.w, R2.z, PV19.x 21 x: OR_INT ____, R0.w, PV20.x y: OR_INT ____, R0.x, PV20.z z: OR_INT ____, R0.z, PV20.w w: OR_INT ____, R0.y, PV20.y 22 x: XOR_INT R0.x, R1.x, PV21.y y: XOR_INT R0.y, R1.y, PV21.w z: XOR_INT R0.z, R1.z, PV21.z w: XOR_INT R0.w, R1.w, PV21.x 23 x: MULLO_INT ____, KC0[0].y, R5.w y: MULLO_INT R1.y, KC0[0].y, R5.w z: MULLO_INT ____, KC0[0].y, R5.w w: MULLO_INT ____, KC0[0].y, R5.w 24 x: AND_INT R7.x, R3.z, R0.x y: AND_INT R7.y, R3.y, R0.y z: AND_INT R7.z, R3.x, R0.z w: AND_INT R7.w, R3.w, R0.w 25 x: LSHR R6.x, R1.y, 2 y: XOR_INT R1.y, R4.y, PV24.y VEC_120 z: XOR_INT R1.z, R4.x, PV24.z w: XOR_INT R1.w, R4.w, PV24.w 26 x: XOR_INT R1.x, R4.z, R7.x y: AND_INT R3.y, R4.y, PV25.y z: AND_INT R3.z, R4.x, PV25.z w: AND_INT R3.w, R4.w, PV25.w 05 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R1, ARRAY_SIZE(4) MARK VPM 06 ALU: ADDR(110) CNT(9) 27 x: AND_INT R3.x, R4.z, R1.x y: OR_INT R4.y, R2.y, R3.y z: OR_INT R4.z, R2.w, R3.z w: OR_INT R4.w, R2.x, R3.w 28 x: ADD_INT R1.x, R6.x, 4 y: XOR_INT R5.y, R0.y, PV27.y z: XOR_INT R5.z, R0.z, PV27.z w: XOR_INT R5.w, R0.w, PV27.w 07 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R1], R3, ARRAY_SIZE(4) MARK VPM 08 ALU: ADDR(119) CNT(3) 29 x: OR_INT R4.x, R2.z, R3.x 30 x: ADD_INT R3.x, R6.x, 8 09 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R3], R4, ARRAY_SIZE(4) MARK VPM 10 ALU: ADDR(122) CNT(3) 31 x: XOR_INT R5.x, R0.x, R4.x 32 x: ADD_INT R4.x, R6.x, 12 11 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R4], R5, ARRAY_SIZE(4) MARK VPM 12 ALU: ADDR(125) CNT(2) 33 x: ADD_INT R4.x, R6.x, 16 13 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R4], R0, ARRAY_SIZE(4) MARK VPM 14 ALU: ADDR(127) CNT(2) 34 x: ADD_INT R6.x, R6.x, 24 15 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(8)[R6], R7, ARRAY_SIZE(4) MARK VPM 16 END END_OF_PROGRAM

                                • IL compiler optimization curiosity
                                  LeeHowes

                                  That last clause you're seeing is a bit of a pain. I think you'll struggle to avoid it, unfortunately. There's probably no way to shift more work before the very first memory access.

                                    • IL compiler optimization curiosity
                                      corry

                                      Lee, you've made my day.  It's nice to know exactly what I'm working with!  I know things will change in the future, and that's ok, FSA is going to seriously rock, so I'm already prepared for some change :) 

                                      anyhow, I'm already late for getting on the road to Philly (from DC) b/c I stayed about 2 hours late at work!  I seriously think of myself as going to work and playing with toys though, so when I start feeling like I'm getting somewhere, its hard to tear myself away.  As it is, I'll be thinking about things all weekend...can you believe, I almost can't wait to get back to work hah! (almost....:) )

                                      Thanks again for the explanation.  I can't express how valuable it is to my ability to write for this!  I think the guy on here who was writing an assembler for the 69xx series cards will certainly appreciate it as well.  I'm sure there will be others who will silently appreciate it as well!

                                      Have a good weekend everyone!

                              • IL compiler optimization curiosity
                                LeeHowes

                                The hardware is SIMD and VLIW. The way I picture it is that the Cayman design launches a VLIW packet of 4 16-wide SIMD instructions on each cycle (quarter wavefront pipelining). It's a VLIW packet of vector instructions. The way we represent that programmatically is that each lane of the SIMD unit, each visible IL instance or each work item, depending on how you look at it, is issuing a vector instruction on each cycle.

                                The IL program you see represents one lane of the SIMD vector. It's "scalar" in a sense, though calling it scalar is a a bit of an abuse of the term (Fermi is not a scalar architecture but we can call it that way is we're flexible with naming). In Cayman each "scalar" IL program is packed into a VLIW ISA program - that VLIW ISA program still represents the same single lane of the SIMD execution. There is NO SIMD4 anywhere in the system. There used to be back in 5xx days (or earlier?) but not now. Now there is a 64-wide vector program where each lane is a 4-wide VLIW execution.

                                IL allows a certain amount of vectorness come in, it is true, but only in the case of loads and stores is an actual 4-wide operation happening - and that's not really a SIMD op it's just a 128-bit load from memory. You can get perfectly good VLIW packing from a completely scalar OpenCL input, for example.

                                  • IL compiler optimization curiosity
                                    Raistmer
                                    Originally posted by: LeeHowes


                                    IL allows a certain amount of vectorness come in, it is true, but only in the case of loads and stores is an actual 4-wide operation happening - and that's not really a SIMD op it's just a 128-bit load from memory. You can get perfectly good VLIW packing from a completely scalar OpenCL input, for example.



                                    If one makes addition (for example) of 2 float4 registers like
                                    float4* in;
                                    float4 A=in[0];
                                    float4 B=in[1];
                                    float4 C=A+B;

                                    Why last operator can't be treated as SIMD instruction? Probably, it will be translated into single VLIW instruction, yes? So, single instruction operates on 4*2 data inputs and stores 4 results, why not SIMD ?
                                    VLIW perhaps can be more complex than SIMD (if it can containt different operations in different VLIW slots), but certainly it can mimic SIMD too, why not ?

                                      • IL compiler optimization curiosity
                                        corry

                                         

                                        Originally posted by: Raistmer
                                        Originally posted by: LeeHowes IL allows a certain amount of vectorness come in, it is true, but only in the case of loads and stores is an actual 4-wide operation happening - and that's not really a SIMD op it's just a 128-bit load from memory. You can get perfectly good VLIW packing from a completely scalar OpenCL input, for example.

                                         

                                        If one makes addition (for example) of 2 float4 registers like float4* in; float4 A=in[0]; float4 B=in[1]; float4 C=A+B; Why last operator can't be treated as SIMD instruction? Probably, it will be translated into single VLIW instruction, yes? So, single instruction operates on 4*2 data inputs and stores 4 results, why not SIMD ? VLIW perhaps can be more complex than SIMD (if it can containt different operations in different VLIW slots), but certainly it can mimic SIMD too, why not ?


                                        If I understand Lee correctly, there literally is no SIMD. VLIW4 means they have 4 instruction slots (plus the literals). What I'm still unclear on is how their dependencies work. So if my understanding is correct, the following will generate to...

                                        xor r32.x, r32.x, r33.x
                                        iadd r23.x, r23.x, r22.x
                                        imul r12.x, r12.x, r11.x
                                        imul_hi r12.x, r12.x, r0.x

                                        ALU: ADDR(42) CNT(42)
                                        1 x: XOR_INT R32.x, R32.z, R33.x
                                        y: ADD_INT R23.x, R23.x, R22.x
                                        z: MULLO_INT R12.x, R12.x, R11.x
                                        w: MULHI_INT R12.x, R12.x, R0.x

                                        and not to....

                                        ALU: ADDR(42) CNT(42)
                                        1 x: XOR_INT R32.x, R32.z, R33.x
                                        2 x: ADD_INT R23.x, R23.x, R22.x
                                        3 x: MULLO_INT R12.x, R12.x, R11.x
                                        4 x: MULHI_INT R12.x, R12.x, R0.x

                                        However, if I changed it to

                                        xor r32.x, r32.x, r33.x
                                        iadd r23.x, r23.x, r32.x
                                        imul r12.x, r23.x, r23.x
                                        imul_hi r10.x, r23.x, r12.x

                                        I would get something more like...

                                        ALU: ADDR(42) CNT(42)
                                        42 x: XOR_INT R32.x, R32.z, R33.x
                                        43 x: ADD_INT R23.x, R23.x, PV42.x
                                        44 x: MULLO_INT R12.x, R23.x, PV43.x
                                        45 x: MULHI_INT R10.x, R12.x, PV44.x

                                        Yes, I know my numbers are all way off. I just really wanted to work 42 in there somehow, makes the problem self solving! :)

                                        Do I have it correctly? Basically the question being, can serial instructions with dependencies be generated inside a single VLIW4 word? If I understand correctly, I'd say no. What the VLIW coding is for is to replace the out of order execution engine in x86 processors, finding non-dependentant code, that can be executed in parallel. SIMD code fits that description, and since graphics/any 3-d math is 4 element vectors, it makes sense IL would use them, and that they would map to the processor well, however, it is not limited to that.

                                        One last question though that remains unanswered. In my previous examples, you mentioned the last blocks not fitting groups of 4. Ok, but the question still remains, is there a penalty for it breaking up use of the PV register. This stems from the HD69xx ISA manual, which talks somewhat unclearly about the GPR read ports. It almost seems like it says only 3 of the 4 elements of a GPR can be read at once, but this seems illogical. If that were the case, using the PV, I would say, would be of the utmost importance! My guess is it's saying, per instruction slot, 3 GPRs can be read. However, again, this seems illogical to mention since it seems most instructions, if not all, have less than 3 source registers. It shows some diagrams, which I am guessing make a lot more sense to hardware engineers, and perhaps embedded guys. However, from even my lower than normal software level, I'm still lost as to the meaning. I know I don't want to mess it up, but I'm not sure how it gets messed up!

                                        Any insights?

                                        Thanks again for the info so far. It really does make a difference, and though I said you made my day, it was more like you made my week. Add this stuff, and I think you'll have made my year! :)

                                          • IL compiler optimization curiosity
                                            LeeHowes

                                             

                                            that C = A + B line could map to SIMD of course, if there were a SIMD unit to map it to. Unfortunately the way the instruction stream works actually what you have is roughly:

                                            float256 C = A + B

                                            ie 64 instances of float4 C = A + B

                                            Each instance will compile down from il to a VLIW packet such that you have:

                                            c.x64 = a.x64 + b.x64

                                            c.y64 = a.y64 + b.y64

                                            c.z64 = a.z64 + b.z64

                                            c.w64 = a.w64 + b.w64

                                            ie four 64-wide SIMD instructions, one in each VLIW slot. 

                                             

                                            Corry: Yes, some dependencies can be generated in VLIW issues. For example, you can issue a full dot product in a single issue slot. Or you can do two add pairs with multiplications attached, I think. There's a set of possible combinations you can apply. Four elements of a GPR should be readable at once, but not in completely arbitrary access patterns, I think. I'm a bit out of touch with this because I'm working on future architecture support rather than current architecture support :)



                                              • IL compiler optimization curiosity
                                                corry

                                                 

                                                Originally posted by: LeeHowes Corry: Yes, some dependencies can be generated in VLIW issues. For example, you can issue a full dot product in a single issue slot. Or you can do two add pairs with multiplications attached, I think. There's a set of possible combinations you can apply. Four elements of a GPR should be readable at once, but not in completely arbitrary access patterns, I think. I'm a bit out of touch with this because I'm working on future architecture support rather than current architecture support :)


                                                I guess it was the diagram really throwing me off.  The text that says, "In hardware, the X, Y, Z, and W elements are stored in separate memories. Each element memory has three read ports per instruction. As a result, an instruction can refer to at most three distinct GPR addresses (after relative addressing is applied) per element" is pretty clear, but trying to match that up to the diagram....ugh...:) 

                                                So based on all this, I'm guessing that even though its not keeping the groups of 4 togather, not using the PV "reg" when available, etc, I'm not losing anything there...still "feels" wrong, and I'd love a practical test for proof...maybe some day :)