5 Replies Latest reply on May 9, 2011 2:36 PM by katayama

    Small ALU clause was generated between ALU clauses.


      When I wrote simple crypto cracker as my OpenCL training, I found very small ALU clause and some MOV instructions were genereted.

      Why clause '07' was splitted from next clause ?

      (I'm using -cl-opt-disable compiler option because lack of this option makes scratch register spill.)

      RADEON HD 6870, Catalyst 11.4 and  APP SDK 2.4.

      06 ALU: ADDR(692) CNT(127) (snip) 108 x: LSHL R0.x, PV107.z, (0x00000002, 2.802596929e-45f).x y: ADD_INT R4.y, PV107.x, PS107 z: LSHR R2.z, PV107.z, (0x00000006, 8.407790786e-45f).y w: XOR_INT R3.w, T3.y, PV107.y t: AND_INT R4.z, T3.w, (0x7F7F7F7F, 3.396151365e38f).z 07 ALU: ADDR(819) CNT(11) 109 x: MOV R0.x, R0.x z: MOV R4.z, R4.z w: MOV R3.w, R3.w t: MOV R2.z, R2.z 110 x: ADD_INT R0.x, R5.w, PV109.z y: LSHL R6.y, PV109.w, (0x00000002, 2.802596929e-45f).x z: AND_INT R4.z, PV109.x, (0xFCFCFCFC, -1.050871953e37f).y w: LSHR R3.w, PV109.w, (0x00000006, 8.407790786e-45f).z t: AND_INT R2.z, PS109, (0x03030303, 3.850089727e-37f).w 08 ALU: ADDR(830) CNT(122) 111 x: OR_INT R39.x, R4.z, R2.z y: AND_INT ____, R3.w, (0x03030303, 3.850089727e-37f).x z: ADD_INT ____, R4.y, (0x01010101, 2.369427828e-38f).y w: AND_INT ____, R6.y, (0xFCFCFCFC, -1.050871953e37f).z VEC_120 t: ADD_INT ____, R0.x, (0x01010101, 2.369427828e-38f).y

        • Small ALU clause was generated between ALU clauses.

          ALU clause maximum size is 128 instructions.

            • Small ALU clause was generated between ALU clauses.

              Or less, if there are literals in the clause.

              A pair of literals consumes 64 bits. The instruction words are formed from bundles of 64-bit codes and literals (up to four per bundle) are optional 6th and 7th 64-bit codes on VLIW-5 chips.

              On VLIW-4 chips they would be optional 5th and 6th 64-bit codes.

              So, for example, bundle 110 contains five opcodes and 4 literals. The four literals are coded as two 64-bit codes, making instruction 110 consist of seven 64-bit codes.

              So a clause can consist of either a maximum of 128 64-bit codes, or 32 bundles (VLIW instructions).

              Basically the VLIW format is variable in length, from one 64-bit code up to seven.

              For Cayman, the newest chip, the VLIW architecture is 4 slots per instruction, so there's a maximum of six 64-bit codes in the case of 4 literals.

              The clause's CNT number in brackets tells you how many 64-bit codes there are.

                • Small ALU clause was generated between ALU clauses.

                  Thanks you for explaining bundle format.

                  I feel I should try to allcoate rest GPRs to constants to pack more instructions in clauses.

                    • Small ALU clause was generated between ALU clauses.

                      Regarding the question: why is clause 07 "short" and 08 is "long", one possible explanation is that 08 is the target of a JUMP instruction somewhere.

                      A JUMP always goes to the start of a clause.

                      So for example the end of clause 04 might evaluate a conditional expression. Clause 05 would consist of a JUMP instruction, which would skip over clauses 06 and 07. This is all my guess, but it is one explanation for the long-short-long construction of these 3 clauses.

                      Normally you'd see these clauses in long-long-short layout.


                      Another possibility has to do with the use of clause-temporary registers. In instruction 108 you can see a clause-temporary called T3.

                      Clause-temporary registers have a scope of a single clause. So T3 belongs to clause 06 and the value is lost once clause 06 has completed. T3 can be used again in clauses 07 and 08, but it initially has an undefined value in each of those clauses.

                      It's possible that the compiler has used clause temporaries in 08. To manage the scope (which is defined by the clause as a maximum of 128 64-bit codes or 32 bundles) the compiler had to choose a place to split the 133 64-bit codes (11 + 122 codes in clauses 07 and 08). The split is "early" and my guess is that T0, T1, T2 and T3 GPRs (or a subset, e.g. T0 and T1) are used in clause 08.

                      So, my second guess is that the compiler has chosen to use clause-temporary registers for some of the instructions in 08. The scope rule means that the compiler had to make a choice about the split point for 07 and 08. In this case the split is early.