8 Replies Latest reply on Oct 3, 2011 5:32 PM by corry

    register arrays


      I keep having trouble refinding the sources for all the assumptions I had made, so let me just ask this, is it explicitly allowable to index a GP reg and as a destination using the a0 reg, and do I have to use mova like the docs say.  I only just refound those 2 pieces of info and had found by experimentation that something like  mov a0, l0, and something like add r[a0.x+r1.x], r5, r6 compiled just fine, but I fear "undefined" results...

      Second, somewhere, I thought I saw index temp register arrays aren't actually registers. I had also thought I had seen that along with the dcl statement, that a calResAlloc was required, but I can't seem to find that anymore.

      Last, is there a better (faster) way of doing things? r[a0+offset] or x0[offset]? Seems the a0 would have an additional alu call, but if x0 isn't really a register and is just ram or something, than I suppose all bets are off!

      Thanks for still answering questions on this stuff!


        • register arrays
          1) No, a0 is not accessible in IL.
          2) indexed temps can either exist in global memory or in register depending on the usage pattern and size of the array
          3) if x0[offset] cannot be optimized into registers, then using LDS is your next best bet.
            • register arrays


              Ok, well, then the follow on question...In the CAL 2.0 spec, it says


              Originally posted by AMD_CAL_ProgrammingGuide_v2.0.pdf on page G-5: General-purpose register. GPRs hold vectors of either four 32-bit IEEE floating-point, or four 8-, 16-, or 32-bit signed or unsigned integer or two 64-bit IEEE double precision data components (values). These registers can be indexed, and consist of an on-chip part and an off-chip part, called the “scratch buffer,” in memory.

              and from the 2.3 IL spec dated July 2011,


              AMD_Intermediate_Language_(IL)_Specification_v2.pdf on page 1-3 it has the following code: ; Set the value of constant register 0 to (1.0,0.0,0.0,0.0)
              def c0 1.0, 0.0, 0.0, 0.0
              ; Set the value of constant register 3 to (4.0,5.0,6.0,1.0)
              def c3 4.0, 5.0, 6.0, 1.0
              ; Move the value of constant register 0 into address register
              mova a0, c0
              ; Use the relative address register to select c3 (a0.x + 2)
              mov r0, c[a0.x + 2]
              This code is equivalent to:
              mov r0, c3
              Once this shader has run, register number 0 of IL_REGTYPE_TEMP contains the
              value (4.0, 5.0, 6.0, 1.0).

              The first thing calls out that they are supposed to be indexable (and I didn't see before, but says that they could be put in "scratch memory" as well...yuck), and the second shows a0 being used in IL.  That said, I can't see anything using r[a0.x], which is what prompted the question.

              Worse case, I think I can work around it using a switch statement on the desired register, and unrolling the loop....ugly, but does all I can to ensure things stay in registers...

                • register arrays

                  I'm pretty close to trying this anyways, still hoping for a clarification on that old stuff :)

                    • register arrays

                      well, a0 = {0x0b, 0, 0, 0} and mov r[a0] resulted in r0 getting blown away. 

                      That said, I'm still waiting on an answer based on the documentation, and let me add one more, the part that I said I'd have to search for again....I found it again. 


                      Originally posted by AMD_Intermediate_Language(IL)_Specification_v2.3.pdf page 2-3 Section 2.2.4: thus, the IL now allows an additional modifier (register relative modifier). Two kinds of destinations can be indexed: IL_TEMP_ARRAY and IL_OUTPUT.

                      Interestingly enough, a google search of IL_TEMP_ARRAY turns up the very same PDF....only.  Seems IL supports some type of non-scratch reg array, but never says a word more about it....I really don't want to do the very large switch statement...

                      Good news is, it doesn't crash...bad news is, I suspect the use of r0 to be undefined, and worse, it's not like it even just indexed from r0 or something...

                      It would be nice to have a definitive answer, even if it were "This is something we were toying around with adding to the spec, but we never got around to implementing it fully", or something.  Just why does it seem that in 3 seperate places that this should be allowed, but in the end doesn't seem to be?

                • register arrays
                  a0 is not supported. It was supported on pre-r600 devices, but was removed when indexed temps were introduced, the documentation here is incorrect.
                    • register arrays

                      Well, I think I have finally got to the bottom of this, but the Cayman ISA docs still have at least the mova instructions, and talk about the address register.

                      That said, as I skim these docs, I seem to be getting burned at page breaks...Re-read section 1.4.2 about 100000 more times, amd saw the restrictions....base relative addressing source register can only be of type CONST_FLOAT (kind-of a round about way of saying base relative indexing can only be used on const floats, but oh well...then right on the page break, Base Relative addressing cannot be used on a destination...argh...


                      I'm using between 16 and 20 registers for temp storage that I want to use as an array....I somehow doubt indexed temp is going to be nice and put that in registers for me.  Is it as straightforward as checking the ISA for loads/stores vs register access to see if it did put it in memory?  (I would think so, but, you never know...)

                      I'll also go one record as saying....I want this in APUs :)  Its something I've never had with x86, and now that I had a taste of whats possible, I want :)  (I know I know, wish in one hand....)

                        • register arrays

                          Interesting....I look at the Registers used, and it went up for some reason, converting to X0 only, but so did my anticipated thread throughput???

                          Even more interesting, I look at  the isa generated, and what do you suppose I see?  Perhaps something like the following? :)

                           11 ALU: ADDR(266) CNT(6)
                                   43  x: MOVA_INT    A0.x,  R1.x     
                                   44  x: MOV         R18[A0.x].x,  R0.x     
                                       y: MOV         R18[A0.x].y,  R0.y     
                                       z: MOV         R18[A0.x].z,  R0.z     
                                       w: MOV         R18[A0.x].w,  R0.w     
                                   45  x: ADD_INT     R1.x,  R1.x,  1     

                          Exactly what I was trying to get it to generate with mov a0.x, gpr.x, mov r[a0.x] Ugh....well, at least I know its using GPRs and not global memory in this case...

                            • register arrays

                              I see no reason to start a new thread about this, though the topic is slightly different.

                              Yes, everything seems to work fine with x0, and x0 is in registers not in memory as I had stated for my apparently rather small arrays.

                              That said, I decided to look into optimizing a bit, and since I need some help with the counters, figured I'd use what I have in the KernelAnalyzer.  I figured, I know I'm wasting registers, so I might as well fix that.  As you can imagine, this relates to register arrays.

                              Heres the basic gist of the problem.  I have some temp register array, x0 used for block processing of unknown sized incoming chunks of data (with a maximum before we process).  In my code, I had a particularly wasteful macro which I would transform the data in x0[0]-x0[n] into registers r32-r32+n.  Yeah, wasteful since I don't care about having the original incoming data once its transformed by that macro.  So I changed the call from mcall(0), (r32), (x0[0]) to mcall(0), (x0[0]), (x0[0]).  Suddenly, my register count goes UP...Is this a macro problem?  i.e. it has to use an intermediate variable, an IL problem, a compiler problem, or an ISA problem (has to use a temp?) 

                              I'm going to try expanding out this macro as its not particularly huge, and I hope that fixes things, but I wanted to get this up so Micah or someone else knowlegable about this stuff might see this and tell me what I might be doing wrong before I get too far off course! :)



                              I unrolled the macro, and WTF?!  Register count went up AGAIN!  First I say don't use the registers my wasteful code is saying to use, and it uses more, instead of almost half as it should have, next I say ok look, *REALLY* don't use any more registers!  And it says Great, I'll use more! 

                              Is there some sort of best practice to get this down?  Or am I just going to have to learn to start writing ISA? ;)