cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

realhet
Miniboss

Subroutines on the 7970 ISA

Hi again,

I'm trying to make subroutines in ISA, right now, I've managed to do it with a one level stack this way:

@subRoutine:

    v_mov_b32 v1, 128  //do something

    s_setpc_b64 s[32:33]  //return to caller

@main:

  s_getpc_b64 s[32:33]

  s_add_u32 s32,s32,12

  s_addc_u32 s33,s33,0  //calculate return address

  s_branch @subRoutine  //call the subroutine

  ...

It's working well, but only with 1 level nesting. I've figured out that using s_movrel or LDS I can make a small stack to expand it.

Then I've found a thing called branch-stack along with the instructions:

s_cbranch_i_fork cond, addr

s_cbranch_join s0   // s0: Saved CSP value.

But it crashed (as expected) since I absolutely don't know what is that 'CSP value'.

If anyone can explain how to use this branch-stack the elegant way, I'd be really thankful.

0 Likes
1 Solution
drallan
Challenger

Hi realhet,

I've used fork and join to split thread execution up to 2 levels. It can be used as a call but I'm not sure it helps much. The problem is the join instruction must reference a stack location, i.e., it's not a true return statement. WARNING:hack zone....

1. The 'stack' is using SGPR space, not memory. When execution starts, the stack pointer points to SGPR0 (s0).

There are two settings to set the stack, which are most likely in one of the hardware registers.

   sq_wave_mode_csp_offset Conditional-branch Stack Pointer

  sq_wave_mode_csp_size   Limit to stack size(?)

This example uses CSP=0

2. Each stack entry uses 4 sgprs, for CSP=0 they are:

    s[0:1] = 64 bit mask of threads NOT forked

    s[2:3] = 64 bit physical address of fork instruction + 4

    The stack grows by 4 sgprs after the fork instruction

3. The fork instruction does the following (well, seems too....)

  a. writes s[SP:SP+1] with bits for the currently active threads that are NOT forked.

  b. writes s[SP+2:SP+3] with the 64b return address just beyond the fork instruction

  c. halts execution of threads that are not forked.

  d. increments the stack pointer by 4.

  e. branches to PC + 16-bit immediate offset.

The following code implements a one level fork and join but the same ideas work with

more levels. v6 traces where the threads go. In this example, only thread id=1 forks.

Note that just before s_cbranch_join, the exec mask is or'ed to the CSP value to activate all threads. Without the OR, only the un-forked threads (saved on the stack) will be active, the forked threads will thus stop. This provides flexibility but might make call/return messy.

    v_mov_b32        v6, 1               //VPGR to trace threads' paths, start=1

    s_movk_i32       s8, 2               //fork mask low-32, thread 1=on only

    s_movk_i32       s9, 0               //fork mask hi-32, all threads off

    s_cbranch_i_fork s[8:9], label_fork  //fork thread 1, others halt

    v_or_b32         v6, 2, v6           //RETURN POINT for fork, trace|=2

    s_branch         label_end           //all threads come here and go to 'end'

label_fork:

    v_or_b32         v6, 4, v6           //trace|=4

    s_or_b64         s0, exec, s0        //or active threads to threadmask on stack

    s_cbranch_join   s0                  //fork threads branch to RETURN POINT

label_end:

    tbuffer_store_format_xy v[5:6], out, s[4:7], TFORM_XY  //write some results

    s_endpgm

The s_cbranch_join instruction simply restores the environment for the halted threads that didn't fork. It's up to the program to define other functionality through the exec mask. Program output is:

Thread 0        3

Thread 1        7   (forked thread)

Thread 2..63    3

drallan

View solution in original post

0 Likes
10 Replies
drallan
Challenger

Hi realhet,

I've used fork and join to split thread execution up to 2 levels. It can be used as a call but I'm not sure it helps much. The problem is the join instruction must reference a stack location, i.e., it's not a true return statement. WARNING:hack zone....

1. The 'stack' is using SGPR space, not memory. When execution starts, the stack pointer points to SGPR0 (s0).

There are two settings to set the stack, which are most likely in one of the hardware registers.

   sq_wave_mode_csp_offset Conditional-branch Stack Pointer

  sq_wave_mode_csp_size   Limit to stack size(?)

This example uses CSP=0

2. Each stack entry uses 4 sgprs, for CSP=0 they are:

    s[0:1] = 64 bit mask of threads NOT forked

    s[2:3] = 64 bit physical address of fork instruction + 4

    The stack grows by 4 sgprs after the fork instruction

3. The fork instruction does the following (well, seems too....)

  a. writes s[SP:SP+1] with bits for the currently active threads that are NOT forked.

  b. writes s[SP+2:SP+3] with the 64b return address just beyond the fork instruction

  c. halts execution of threads that are not forked.

  d. increments the stack pointer by 4.

  e. branches to PC + 16-bit immediate offset.

The following code implements a one level fork and join but the same ideas work with

more levels. v6 traces where the threads go. In this example, only thread id=1 forks.

Note that just before s_cbranch_join, the exec mask is or'ed to the CSP value to activate all threads. Without the OR, only the un-forked threads (saved on the stack) will be active, the forked threads will thus stop. This provides flexibility but might make call/return messy.

    v_mov_b32        v6, 1               //VPGR to trace threads' paths, start=1

    s_movk_i32       s8, 2               //fork mask low-32, thread 1=on only

    s_movk_i32       s9, 0               //fork mask hi-32, all threads off

    s_cbranch_i_fork s[8:9], label_fork  //fork thread 1, others halt

    v_or_b32         v6, 2, v6           //RETURN POINT for fork, trace|=2

    s_branch         label_end           //all threads come here and go to 'end'

label_fork:

    v_or_b32         v6, 4, v6           //trace|=4

    s_or_b64         s0, exec, s0        //or active threads to threadmask on stack

    s_cbranch_join   s0                  //fork threads branch to RETURN POINT

label_end:

    tbuffer_store_format_xy v[5:6], out, s[4:7], TFORM_XY  //write some results

    s_endpgm

The s_cbranch_join instruction simply restores the environment for the halted threads that didn't fork. It's up to the program to define other functionality through the exec mask. Program output is:

Thread 0        3

Thread 1        7   (forked thread)

Thread 2..63    3

drallan

0 Likes

Thanks for accurate info!

Your example worked well, then I played with it:

1. If I put -1 into s9

   s_mov_b32        s9, 0xffffffff

then results will be: thread 1, 32..63  -> undefinied  (mainly zeroes, for thread 0..8 -> 0..8 that I never write into v6, no matter I zero it out before or not)

2.  if i put exec into s[8:9] (which is $FFFFFFFFFFFFFFFF) then it runs perfectly for the first time (all threads are 7) then on the second run it crashes.  Same goes on when I manually set all the bits to 1, so there is something wrong happening when all the threads are forking.

3. all thread forking except the first one: thread0 -> correct 3, all other threads are garbage

4. lets test when forkmask.hi=0 while it produces the expected result! Putting things after label_end from before the s_branch to it:

- First put that "//RETURN POINT for fork, trace|=2"  -> all OK

- Then put  "s_cbranch_i_fork s[8:9], label_fork" down rigth after label_end -> CRASH (which is weird, because the program flow was not altered, only the jump's position was shifted)

Because of that I think its not a push_and_jump thing, but something like predicators on the vliw hardware.

This fork and join thing must be something that have been invented for very special tasks in geometry/hull/domain shaders.

Now I know that branch-stack = SRegs, so s_movrel and set/get/swappc will do fine and at least that way I'll know what I'm doing.

Thank You for the reply!

Edit: When I mentioned garbage like it was actually the initial, untouched data in the uav, so maybe s_join not restored the exec mask and tbuffer_store became idle on some threads.

0 Likes

Re: 1,2, and 3.

Yes, I see the same. Certain mask combinations fail. After some testing, it looks like fork gives priority to the minor thread. If <=32 threads fork, they run first and the remaining threads are put on the stack for the join instruction. If >32 threads fork, the non-forking threads run first and the forking threads are put on the stack. Maybe there needs to be a join at the end of each branch? Especially with 3 or more. I think your right, these are there for special kinds of problems.

I wonder if anyone knows when AMD will release more detailed information on GCN? Should be soon, no?

0 Likes

We are finalizing the documentation for the ISA and it will be released after this is done.

0 Likes

Oh, I'm really happy to hear this!

0 Likes

Just have discovered s_movrel (the minihelp in the .dll is a bit misleading, tho)

Tried to write this simple thing:

int uav[], dstIdx=0, maxIdx=20;

void fibonacci(int a,b)

{

  uav[dstIdx++]=a+b;

  if(dstIdx<maxIdx) fibonacci(b,a+b);

}

main() {fibonacci(1,1);}

..using the S Alu of the Tahiti chip, and it worked  

The stack pointer is the "m0" register initialized to 104

every call pushes 2 parameters and a 64bit return addres. (btw on a 3GB card it's not necessary to push the high dword of addresses, I think those are absolute addresses)

@Fibonacci:

  //entry code

  s_sub_u32 m0, m0, 2 \ s_movreld_b64 s0, s0      //push return addr s[0:1]

  s_movrels_b32 s0, s3                             //get 1st param //s0=ret_addr

  s_movrels_b32 s1, s2                             //get 2nd param

  //do fibonacci  //^^ those are just indices, not the actual contents of SRegs

  s_add_i32     s2, s0, s1                           

  //write the result

  v_writelane_b32 v1,s2,0

  tbuffer_store_format_x v1, v0, uav, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

  v_add_i32     v0, vcc, 4, v0                          //increment dst offset

  v_cmp_le_u32  vcc, maxAddr, v0                   //limit recursion

  s_cbranch_vccnz @nomore

      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s1       //push 1st param

      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s2       //push 2nd param

      s_swappc_b64 s[0:1], Fibonacci                           //call recursive

  @nomore:

  s_movrels_b64 s0,s0 \ s_add_u32 m0,m0,2         //pop return addr

  s_add_u32 m0,m0,2                               //clear parameters

  s_setpc_b64 s[0:1]                              //ret

//results: 2,       3,       5,       8,      13,      21, ...

There are every building blocks present in order to make a fully functional C compiler for the S-ALU. That's the real 'General Purpose'

0 Likes

Now, that's nice, fully recursive using 20 lines of assembly code..

Using the S-ALU, maybe full computers can be made without Intel cores.

Tahiti has 32 S-ALUs, should be enough.

0 Likes

Yea, the same I was thinking of , imagine a slow 4core x86 at 1GHz with (more or less) 64bit arithmetic and 8GB/sec mem bandwith but it has 2048bit SSE! Plus you can combine a legacy instruction with an SSE instruction in a single cycle (with some code size restrictions, but those are exists on x86 too).

Put 32 of it on a single chip, and there goes the 7970.

Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha

0 Likes

realhet wrote

Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha

yes, and 86 still has the cl register from the old traffic light controller days.

GCN has a lot of cleaver features, so, perhaps AMD will conquer the world after all !

0 Likes

Hello again,

Now that the new architecture manual is out that became clear how that fork and join exactly work. (Thanks to AMD for the manual!)

So it's basically an IF-THEN-ELSE block with a twist -> When the threads are diverging it will always execute the more popular 'fork' first. I guess it's important when workgroupsize>64, but when workgroupsize=64 I don't know, why is this good.

    v_mov_b32       v6, 0                          //VPGR to trace threads' paths, start=0

    s_mov_b32       s8, $ffff0000                  //fork mask low-32

    s_mov_b32       s9, $0ffffff0                  //fork mask hi-32

---------------------------------------------------------

    s_mov_b64       s62, exec                      //save exec

    s_getreg_b32    s64, hwreg(HW_REG_STATUS,29,3) //Save CSP (Conditional Stack Ptr)

    s_cbranch_i_fork s[8:9], label_fork            //fork threads, others halt

      v_or_b32      v6, 1, v6    //FALSE path

    s_cbranch_join  s64                            //this will call label_fork: when needed

    s_branch label_end                             //otherwise end of fork

    label_fork:

      v_or_b32      v6, 8, v6    //TRUE path

    s_cbranch_join  s64                            //this will call the first fork if needed

label_end:

    s_mov_b64       exec, s62                      //restore exec after fork

So one fork needs 2 joins. Additional things needed: save/restore exec, and read the CSP form the HWRegs. You have to save CSP only once (even for a 6 level nested thing), and save/restore exec for every fork/join block (I guess).

The classic equivalent for this (without automatic nesting(based on a saved CSP value) and 'thread sorting') would be:

    v_mov_b32       v6, 0                          //VPGR to trace threads' paths, start=0

    s_mov_b32       s8, $ffff0000                  //fork mask low-32

    s_mov_b32       s9, $0ffffff0                  //fork mask hi-32

---------------------------------------------------------

    s_and_saveexec_b64  s62, s8      //note: scc0 == execz after _saveexec

      s_cbranch_scc0      @else

          v_or_b32          v6, 8, v6    //TRUE path

@else:

      s_andn2_b64         exec, s62, s8

      s_cbranch_scc0      @end

          v_or_b32          v6, 1, v6    //FALSE path

@end:

      s_mov_b64           exec,s62

I think fork/join can be best when nesting it. Without those, _saveexec and andn2 are really cool alternatives.

ps: Just discovered that hwreg(a,b,c) indexing thing, it's so clever to combine hwregister accessing with shifting and masking

ps2: Just finished reading the manual: Check "lds_direct" in it, we'd never figure that out without the manual haha! Also there are the specifications for the buffer resource descriptors and the sampler descriptors.

0 Likes