9 Replies Latest reply on Jan 4, 2010 4:07 PM by empty_knapsack

    ALU packing (evergreen)

    frankas

      I was pleased to see that the Evergreen instruction set was published with the 2.0 release. But try as I may, I can't find any documents with information on how to optimally pack the ALU instructions.

      For instance, I assume that the cosine instruction can only be issued in the t unit, although this is not stated. Whats more, the IL specification talks about the cosine instruction operating on a vector (xyzw of a register) - which seems to conflict with the microcode operating on a single 32 bit register.

      The kind of ducumentation I am looking for would be:

      How many and which instructions can be coissued in a VLIW.

      Which instructions are only legal in the xyzw units

      Which intructions are only legal in the t unit.

      Which instructions can be issued to any unit.

       

      In short information needed to get fuller utilization of the stream cores in the ALU clauses. Currently my kernels very often use 4 or less out of 5 units ( <80%) - even when there is no data dependancy, and I am trying to understand which changes I can make to get closer to 100% utilization.

       

      Any pointers will be much appreciated.

       

        • ALU packing (evergreen)
          the729

          Hi, frankas

          As far as I know, sine/cosine/exp/log and 32bit float - integer conversion are only legal on t slot. Normal 32bit arithmetic operations such as add and mul are legal on any slot. 64bit operations may take 2 or 4 slots to run together.

          As you mentioned, cos_vec needs 4 instructions, all on t unit. However, the shader compiler that compiles IL to ISA does heavy optimization, trying to utilize as much slots as possible. It can automatically detect data dependancy and reorder instruction, or remove useless instructions.

          You can inspect the ISA code with SKA (stream kernel analyzer) and see which part of your code waste ALU slots most.

            • ALU packing (evergreen)
              frankas

               

              Originally posted by: the729 Hi, frankas

               

              As far as I know, sine/cosine/exp/log and 32bit float - integer conversion are only legal on t slot. Normal 32bit arithmetic operations such as add and mul are legal on any slot. 64bit operations may take 2 or 4 slots to run together.

               

              As you mentioned, cos_vec needs 4 instructions, all on t unit. However, the shader compiler that compiles IL to ISA does heavy optimization, trying to utilize as much slots as possible. It can automatically detect data dependancy and reorder instruction, or remove useless instructions.

               

              You can inspect the ISA code with SKA (stream kernel analyzer) and see which part of your code waste ALU slots most.

               

              Thank you, I knew this much. but my question really is where can I find more information on which intructions will fit in a particular slot....

               

            • ALU packing (evergreen)
              MicahVillmow
              frankas,
              I do not believe there is a list anywhere, but you should be able to derive this information from each instruction in the ISA doc. They all specify their constraints. For example,
              "ADD_PREV
              Description Add src0 to the previous channel's result. The previous channel opcode must result in a 32-
              bit, single-precision floating-point.
              The output modifier and clamping on the w/z slot is not allowed (results are undefined).
              Do not use in w or z channels.
              The previous channel y is the w channel's FP32 result.
              The previous channel x is the z channel's FP32 result.
              dst = src0 + prev_channel_result"

              I think between the R8XX ISA and the R6XX ISA doc's all the instructions are covered.
                • ALU packing (evergreen)
                  frankas

                  I had another look in the stream kernelanalyzer, and it seems that what I observed is a weakness in the IL compiler (v1.3?)

                   

                  For instance:

                  "mov r324, r271 \n"
                  "mov r325, r272 \n"
                  "mov r326, r273 \n"

                  compiles to:

                       47  x: MOV         R3.x,  R80.x     
                           y: MOV         R3.y,  R80.y     
                           z: MOV         R3.z,  R80.z     
                           w: MOV         R3.w,  R80.w     
                       48  x: MOV         R4.x,  R81.x     
                           y: MOV         R4.y,  R81.y     
                           z: MOV         R4.z,  R81.z     
                           w: MOV         R4.w,  R81.w     
                       49  x: MOV         R5.x,  R7.x     
                           y: MOV         R5.y,  R7.y     
                           z: MOV         R5.z,  R7.z     
                           w: MOV         R5.w,  R7.w     

                  Where the compiler seems unable to split / merge the move instructions, and leaves the "t" slot unused. But other places where there are similar moves, but with swizzles, the t slot is used for moving. The same goes for xor.

                  Example of a fully packed move elsewhere:

                              162  x: MOV         R55.x,  R55.y     
                                   y: MOV         R55.y,  R55.z     
                                   z: MOV         R55.z,  R55.w     
                                   w: MOV         R56.w,  R55.x     
                                   t: MOV         R55.w,  R54.x     
                              163  x: MOV         R54.x,  R54.y     
                                   y: MOV         R54.y,  R54.z     
                                   z: MOV         R54.z,  R54.w     
                                   w: MOV         R54.w,  R53.x     
                                   t: MOV         R53.x,  R53.y     
                              164  x: MOV         R52.x,  R52.y     
                                   y: MOV         R53.y,  R53.z     
                                   z: MOV         R53.z,  R53.w     
                                   w: MOV         R53.w,  R52.x     
                                   t: MOV         R52.y,  R52.z     

                  I will upgrade to 2.0 soon, and hope this has been corrected in the latest version.

                   

                   

                  • ALU packing (evergreen)
                    Gipsel

                     

                    Originally posted by: MicahVillmow frankas, I do not believe there is a list anywhere, but you should be able to derive this information from each instruction in the ISA doc.


                    Actually there is such a list. In the earlier R600 (and maybe also the R7xx) documentation there were such a listing of the instructions separate for  xyzw ALUS only, xyzwt, or t only instructions.

                    In the current ISA reference guide for Evergreen GPUs it is organized a bit differently, but one finds this information in chapter 2.2 starting on page 2-27 for instructions with up to two source operands:

                     

                    Opcodes 0..95 can be used in either the Vector or Trans unit. Opcodes 128..159 are Trans only. Opcodes 160..255 are vector only.


                    with an opcode list following that sentence. For instructions with three source operand the list starts on page 2-34.

                    By the way, some nice improvements for the instruction set in Evergreen, which can reduce the amount of needed instructions or increase the ILP for certain tasks. But I found already a glitch in the IL documentation for the bitalign instruction:

                     

                    bitalign dst, scr0, src1, src2

                    Description

                    Aligns bit data for video. This is a special instruction for multi-media video.

                    dst = (src0 << src2.x) || (src1 >> (32-src2.x))

                    src2.x must be 0, 8,16, 24, or 32.

                     

                    The last sentence is not true. It works with arbitrary values (at least I have successfully tested it with 20). The ISA documentation also does not mention such a limitation. If it were true, it would be a redundant instruction anyway, as bytealign should be enough in that case. And it should be a single "|" for the bitwise or and not "||" for the logical one in the description



                      • ALU packing (evergreen)
                        empty_knapsack

                        Yeah, I'm also curious about bitalign documentation. I've tested it with almost every values from 1 to 31 and it works perfectly. One bitalign replacing 3 other instructions when doing 32-bit cyclic rotations and it's quite welcome in cryptography.

                          • ALU packing (evergreen)
                            frankas

                             

                            Originally posted by: empty_knapsack Yeah, I'm also curious about bitalign documentation. I've tested it with almost every values from 1 to 31 and it works perfectly. One bitalign replacing 3 other instructions when doing 32-bit cyclic rotations and it's quite welcome in cryptography.

                             

                             

                            I also noticed that there is an "NSA instruction" (bit population count) which is extremely useful in cryptography.

                             

                              • ALU packing (evergreen)
                                Gipsel

                                But am I cursed with a temporary blindness or are the IL instructions for doing 24bit integer multiplies missing? One finds it in the ISA documentation but nowhere in the IL docs. What's up with them?

                                  • ALU packing (evergreen)
                                    empty_knapsack

                                     

                                    Originally posted by: Gipsel But am I cursed with a temporary blindness or are the IL instructions for doing 24bit integer multiplies missing? One finds it in the ISA documentation but nowhere in the IL docs. What's up with them?

                                     

                                    I was looking for "24" over the whole IL v2 document -- also found nothing about muladd_uint24 or mul_int24 or something similar.