10 Replies Latest reply on Aug 11, 2012 11:08 AM by realhet

    Subroutines on the 7970 ISA


      Hi again,


      I'm trying to make subroutines in ISA, right now, I've managed to do it with a one level stack this way:



          v_mov_b32 v1, 128  //do something

          s_setpc_b64 s[32:33]  //return to caller



        s_getpc_b64 s[32:33]

        s_add_u32 s32,s32,12

        s_addc_u32 s33,s33,0  //calculate return address

        s_branch @subRoutine  //call the subroutine



      It's working well, but only with 1 level nesting. I've figured out that using s_movrel or LDS I can make a small stack to expand it.

      Then I've found a thing called branch-stack along with the instructions:

      s_cbranch_i_fork cond, addr

      s_cbranch_join s0   // s0: Saved CSP value.

      But it crashed (as expected) since I absolutely don't know what is that 'CSP value'.

      If anyone can explain how to use this branch-stack the elegant way, I'd be really thankful.

        • Re: Subroutines on the 7970 ISA

          Hi realhet,


          I've used fork and join to split thread execution up to 2 levels. It can be used as a call but I'm not sure it helps much. The problem is the join instruction must reference a stack location, i.e., it's not a true return statement. WARNING:hack zone....


          1. The 'stack' is using SGPR space, not memory. When execution starts, the stack pointer points to SGPR0 (s0).

          There are two settings to set the stack, which are most likely in one of the hardware registers.

             sq_wave_mode_csp_offset Conditional-branch Stack Pointer

            sq_wave_mode_csp_size   Limit to stack size(?)

          This example uses CSP=0


          2. Each stack entry uses 4 sgprs, for CSP=0 they are:

              s[0:1] = 64 bit mask of threads NOT forked

              s[2:3] = 64 bit physical address of fork instruction + 4

              The stack grows by 4 sgprs after the fork instruction


          3. The fork instruction does the following (well, seems too....)

            a. writes s[SP:SP+1] with bits for the currently active threads that are NOT forked.

            b. writes s[SP+2:SP+3] with the 64b return address just beyond the fork instruction

            c. halts execution of threads that are not forked.

            d. increments the stack pointer by 4.

            e. branches to PC + 16-bit immediate offset.


          The following code implements a one level fork and join but the same ideas work with

          more levels. v6 traces where the threads go. In this example, only thread id=1 forks.


          Note that just before s_cbranch_join, the exec mask is or'ed to the CSP value to activate all threads. Without the OR, only the un-forked threads (saved on the stack) will be active, the forked threads will thus stop. This provides flexibility but might make call/return messy.


              v_mov_b32        v6, 1               //VPGR to trace threads' paths, start=1

              s_movk_i32       s8, 2               //fork mask low-32, thread 1=on only

              s_movk_i32       s9, 0               //fork mask hi-32, all threads off

              s_cbranch_i_fork s[8:9], label_fork  //fork thread 1, others halt

              v_or_b32         v6, 2, v6           //RETURN POINT for fork, trace|=2

              s_branch         label_end           //all threads come here and go to 'end'



              v_or_b32         v6, 4, v6           //trace|=4

              s_or_b64         s0, exec, s0        //or active threads to threadmask on stack

              s_cbranch_join   s0                  //fork threads branch to RETURN POINT



              tbuffer_store_format_xy v[5:6], out, s[4:7], TFORM_XY  //write some results



          The s_cbranch_join instruction simply restores the environment for the halted threads that didn't fork. It's up to the program to define other functionality through the exec mask. Program output is:


          Thread 0        3

          Thread 1        7   (forked thread)

          Thread 2..63    3



            • Re: Subroutines on the 7970 ISA

              Thanks for accurate info!

              Your example worked well, then I played with it:


              1. If I put -1 into s9

                 s_mov_b32        s9, 0xffffffff

              then results will be: thread 1, 32..63  -> undefinied  (mainly zeroes, for thread 0..8 -> 0..8 that I never write into v6, no matter I zero it out before or not)


              2.  if i put exec into s[8:9] (which is $FFFFFFFFFFFFFFFF) then it runs perfectly for the first time (all threads are 7) then on the second run it crashes.  Same goes on when I manually set all the bits to 1, so there is something wrong happening when all the threads are forking.


              3. all thread forking except the first one: thread0 -> correct 3, all other threads are garbage


              4. lets test when forkmask.hi=0 while it produces the expected result! Putting things after label_end from before the s_branch to it:

              - First put that "//RETURN POINT for fork, trace|=2"  -> all OK

              - Then put  "s_cbranch_i_fork s[8:9], label_fork" down rigth after label_end -> CRASH (which is weird, because the program flow was not altered, only the jump's position was shifted)

              Because of that I think its not a push_and_jump thing, but something like predicators on the vliw hardware.


              This fork and join thing must be something that have been invented for very special tasks in geometry/hull/domain shaders.

              Now I know that branch-stack = SRegs, so s_movrel and set/get/swappc will do fine and at least that way I'll know what I'm doing.


              Thank You for the reply!


              Edit: When I mentioned garbage like it was actually the initial, untouched data in the uav, so maybe s_join not restored the exec mask and tbuffer_store became idle on some threads.

              • Re: Subroutines on the 7970 ISA

                Just have discovered s_movrel (the minihelp in the .dll is a bit misleading, tho)


                Tried to write this simple thing:


                int uav[], dstIdx=0, maxIdx=20;

                void fibonacci(int a,b)



                  if(dstIdx<maxIdx) fibonacci(b,a+b);


                main() {fibonacci(1,1);}


                ..using the S Alu of the Tahiti chip, and it worked  

                The stack pointer is the "m0" register initialized to 104

                every call pushes 2 parameters and a 64bit return addres. (btw on a 3GB card it's not necessary to push the high dword of addresses, I think those are absolute addresses)



                  //entry code

                  s_sub_u32 m0, m0, 2 \ s_movreld_b64 s0, s0      //push return addr s[0:1]

                  s_movrels_b32 s0, s3                             //get 1st param //s0=ret_addr

                  s_movrels_b32 s1, s2                             //get 2nd param

                  //do fibonacci  //^^ those are just indices, not the actual contents of SRegs

                  s_add_i32     s2, s0, s1                           

                  //write the result

                  v_writelane_b32 v1,s2,0

                  tbuffer_store_format_x v1, v0, uav, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

                  v_add_i32     v0, vcc, 4, v0                          //increment dst offset

                  v_cmp_le_u32  vcc, maxAddr, v0                   //limit recursion

                  s_cbranch_vccnz @nomore

                      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s1       //push 1st param

                      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s2       //push 2nd param

                      s_swappc_b64 s[0:1], Fibonacci                           //call recursive


                  s_movrels_b64 s0,s0 \ s_add_u32 m0,m0,2         //pop return addr

                  s_add_u32 m0,m0,2                               //clear parameters

                  s_setpc_b64 s[0:1]                              //ret

                //results: 2,       3,       5,       8,      13,      21, ...


                There are every building blocks present in order to make a fully functional C compiler for the S-ALU. That's the real 'General Purpose'



                  • Re: Subroutines on the 7970 ISA

                    Now, that's nice, fully recursive using 20 lines of assembly code..


                    Using the S-ALU, maybe full computers can be made without Intel cores.

                    Tahiti has 32 S-ALUs, should be enough.

                      • Re: Subroutines on the 7970 ISA

                        Yea, the same I was thinking of , imagine a slow 4core x86 at 1GHz with (more or less) 64bit arithmetic and 8GB/sec mem bandwith but it has 2048bit SSE! Plus you can combine a legacy instruction with an SSE instruction in a single cycle (with some code size restrictions, but those are exists on x86 too).

                        Put 32 of it on a single chip, and there goes the 7970.


                        Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha

                          • Re: Subroutines on the 7970 ISA

                            realhet wrote

                            Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha


                            yes, and 86 still has the cl register from the old traffic light controller days.

                            GCN has a lot of cleaver features, so, perhaps AMD will conquer the world after all !

                      • Re: Subroutines on the 7970 ISA

                        Hello again,


                        Now that the new architecture manual is out that became clear how that fork and join exactly work. (Thanks to AMD for the manual!)

                        So it's basically an IF-THEN-ELSE block with a twist -> When the threads are diverging it will always execute the more popular 'fork' first. I guess it's important when workgroupsize>64, but when workgroupsize=64 I don't know, why is this good.


                            v_mov_b32       v6, 0                          //VPGR to trace threads' paths, start=0

                            s_mov_b32       s8, $ffff0000                  //fork mask low-32

                            s_mov_b32       s9, $0ffffff0                  //fork mask hi-32


                            s_mov_b64       s62, exec                      //save exec

                            s_getreg_b32    s64, hwreg(HW_REG_STATUS,29,3) //Save CSP (Conditional Stack Ptr)

                            s_cbranch_i_fork s[8:9], label_fork            //fork threads, others halt

                              v_or_b32      v6, 1, v6    //FALSE path

                            s_cbranch_join  s64                            //this will call label_fork: when needed

                            s_branch label_end                             //otherwise end of fork


                              v_or_b32      v6, 8, v6    //TRUE path

                            s_cbranch_join  s64                            //this will call the first fork if needed


                            s_mov_b64       exec, s62                      //restore exec after fork


                        So one fork needs 2 joins. Additional things needed: save/restore exec, and read the CSP form the HWRegs. You have to save CSP only once (even for a 6 level nested thing), and save/restore exec for every fork/join block (I guess).


                        The classic equivalent for this (without automatic nesting(based on a saved CSP value) and 'thread sorting') would be:


                            v_mov_b32       v6, 0                          //VPGR to trace threads' paths, start=0

                            s_mov_b32       s8, $ffff0000                  //fork mask low-32

                            s_mov_b32       s9, $0ffffff0                  //fork mask hi-32


                            s_and_saveexec_b64  s62, s8      //note: scc0 == execz after _saveexec

                              s_cbranch_scc0      @else

                                  v_or_b32          v6, 8, v6    //TRUE path


                              s_andn2_b64         exec, s62, s8

                              s_cbranch_scc0      @end

                                  v_or_b32          v6, 1, v6    //FALSE path


                              s_mov_b64           exec,s62



                        I think fork/join can be best when nesting it. Without those, _saveexec and andn2 are really cool alternatives.


                        ps: Just discovered that hwreg(a,b,c) indexing thing, it's so clever to combine hwregister accessing with shifting and masking

                        ps2: Just finished reading the manual: Check "lds_direct" in it, we'd never figure that out without the manual haha! Also there are the specifications for the buffer resource descriptors and the sampler descriptors.