cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

Miniboss
Miniboss

Subroutines on the 7970 ISA

Jump to solution

Hi again,

I'm trying to make subroutines in ISA, right now, I've managed to do it with a one level stack this way:

@subRoutine:

    v_mov_b32 v1, 128  //do something

    s_setpc_b64 s[32:33]  //return to caller

@main:

  s_getpc_b64 s[32:33]

  s_add_u32 s32,s32,12

  s_addc_u32 s33,s33,0  //calculate return address

  s_branch @subRoutine  //call the subroutine

  ...

It's working well, but only with 1 level nesting. I've figured out that using s_movrel or LDS I can make a small stack to expand it.

Then I've found a thing called branch-stack along with the instructions:

s_cbranch_i_fork cond, addr

s_cbranch_join s0   // s0: Saved CSP value.

But it crashed (as expected) since I absolutely don't know what is that 'CSP value'.

If anyone can explain how to use this branch-stack the elegant way, I'd be really thankful.

0 Kudos
Reply
1 Solution

Accepted Solutions
Challenger
Challenger

Re: Subroutines on the 7970 ISA

Jump to solution

Hi realhet,

I've used fork and join to split thread execution up to 2 levels. It can be used as a call but I'm not sure it helps much. The problem is the join instruction must reference a stack location, i.e., it's not a true return statement. WARNING:hack zone....

1. The 'stack' is using SGPR space, not memory. When execution starts, the stack pointer points to SGPR0 (s0).

There are two settings to set the stack, which are most likely in one of the hardware registers.

   sq_wave_mode_csp_offset Conditional-branch Stack Pointer

  sq_wave_mode_csp_size   Limit to stack size(?)

This example uses CSP=0

2. Each stack entry uses 4 sgprs, for CSP=0 they are:

    s[0:1] = 64 bit mask of threads NOT forked

    s[2:3] = 64 bit physical address of fork instruction + 4

    The stack grows by 4 sgprs after the fork instruction

3. The fork instruction does the following (well, seems too....)

  a. writes s[SPSmiley FrustratedP+1] with bits for the currently active threads that are NOT forked.

  b. writes s[SP+2Smiley FrustratedP+3] with the 64b return address just beyond the fork instruction

  c. halts execution of threads that are not forked.

  d. increments the stack pointer by 4.

  e. branches to PC + 16-bit immediate offset.

The following code implements a one level fork and join but the same ideas work with

more levels. v6 traces where the threads go. In this example, only thread id=1 forks.

Note that just before s_cbranch_join, the exec mask is or'ed to the CSP value to activate all threads. Without the OR, only the un-forked threads (saved on the stack) will be active, the forked threads will thus stop. This provides flexibility but might make call/return messy.

    v_mov_b32        v6, 1               //VPGR to trace threads' paths, start=1

    s_movk_i32       s8, 2               //fork mask low-32, thread 1=on only

    s_movk_i32       s9, 0               //fork mask hi-32, all threads off

    s_cbranch_i_fork s[8:9], label_fork  //fork thread 1, others halt

    v_or_b32         v6, 2, v6           //RETURN POINT for fork, trace|=2

    s_branch         label_end           //all threads come here and go to 'end'

label_fork:

    v_or_b32         v6, 4, v6           //trace|=4

    s_or_b64         s0, exec, s0        //or active threads to threadmask on stack

    s_cbranch_join   s0                  //fork threads branch to RETURN POINT

label_end:

    tbuffer_store_format_xy v[5:6], out, s[4:7], TFORM_XY  //write some results

    s_endpgm

The s_cbranch_join instruction simply restores the environment for the halted threads that didn't fork. It's up to the program to define other functionality through the exec mask. Program output is:

Thread 0        3

Thread 1        7   (forked thread)

Thread 2..63    3

drallan

View solution in original post

0 Kudos
Reply
10 Replies
Challenger
Challenger

Re: Subroutines on the 7970 ISA

Jump to solution

Hi realhet,

I've used fork and join to split thread execution up to 2 levels. It can be used as a call but I'm not sure it helps much. The problem is the join instruction must reference a stack location, i.e., it's not a true return statement. WARNING:hack zone....

1. The 'stack' is using SGPR space, not memory. When execution starts, the stack pointer points to SGPR0 (s0).

There are two settings to set the stack, which are most likely in one of the hardware registers.

   sq_wave_mode_csp_offset Conditional-branch Stack Pointer

  sq_wave_mode_csp_size   Limit to stack size(?)

This example uses CSP=0

2. Each stack entry uses 4 sgprs, for CSP=0 they are:

    s[0:1] = 64 bit mask of threads NOT forked

    s[2:3] = 64 bit physical address of fork instruction + 4

    The stack grows by 4 sgprs after the fork instruction

3. The fork instruction does the following (well, seems too....)

  a. writes s[SPSmiley FrustratedP+1] with bits for the currently active threads that are NOT forked.

  b. writes s[SP+2Smiley FrustratedP+3] with the 64b return address just beyond the fork instruction

  c. halts execution of threads that are not forked.

  d. increments the stack pointer by 4.

  e. branches to PC + 16-bit immediate offset.

The following code implements a one level fork and join but the same ideas work with

more levels. v6 traces where the threads go. In this example, only thread id=1 forks.

Note that just before s_cbranch_join, the exec mask is or'ed to the CSP value to activate all threads. Without the OR, only the un-forked threads (saved on the stack) will be active, the forked threads will thus stop. This provides flexibility but might make call/return messy.

    v_mov_b32        v6, 1               //VPGR to trace threads' paths, start=1

    s_movk_i32       s8, 2               //fork mask low-32, thread 1=on only

    s_movk_i32       s9, 0               //fork mask hi-32, all threads off

    s_cbranch_i_fork s[8:9], label_fork  //fork thread 1, others halt

    v_or_b32         v6, 2, v6           //RETURN POINT for fork, trace|=2

    s_branch         label_end           //all threads come here and go to 'end'

label_fork:

    v_or_b32         v6, 4, v6           //trace|=4

    s_or_b64         s0, exec, s0        //or active threads to threadmask on stack

    s_cbranch_join   s0                  //fork threads branch to RETURN POINT

label_end:

    tbuffer_store_format_xy v[5:6], out, s[4:7], TFORM_XY  //write some results

    s_endpgm

The s_cbranch_join instruction simply restores the environment for the halted threads that didn't fork. It's up to the program to define other functionality through the exec mask. Program output is:

Thread 0        3

Thread 1        7   (forked thread)

Thread 2..63    3

drallan

View solution in original post

0 Kudos
Reply
Miniboss
Miniboss

Re: Subroutines on the 7970 ISA

Jump to solution

Thanks for accurate info!

Your example worked well, then I played with it:

1. If I put -1 into s9

   s_mov_b32        s9, 0xffffffff

then results will be: thread 1, 32..63  -> undefinied  (mainly zeroes, for thread 0..8 -> 0..8 that I never write into v6, no matter I zero it out before or not)

2.  if i put exec into s[8:9] (which is $FFFFFFFFFFFFFFFF) then it runs perfectly for the first time (all threads are 7) then on the second run it crashes.  Same goes on when I manually set all the bits to 1, so there is something wrong happening when all the threads are forking.

3. all thread forking except the first one: thread0 -> correct 3, all other threads are garbage

4. lets test when forkmask.hi=0 while it produces the expected result! Putting things after label_end from before the s_branch to it:

- First put that "//RETURN POINT for fork, trace|=2"  -> all OK

- Then put  "s_cbranch_i_fork s[8:9], label_fork" down rigth after label_end -> CRASH (which is weird, because the program flow was not altered, only the jump's position was shifted)

Because of that I think its not a push_and_jump thing, but something like predicators on the vliw hardware.

This fork and join thing must be something that have been invented for very special tasks in geometry/hull/domain shaders.

Now I know that branch-stack = SRegs, so s_movrel and set/get/swappc will do fine and at least that way I'll know what I'm doing.

Thank You for the reply!

Edit: When I mentioned garbage like it was actually the initial, untouched data in the uav, so maybe s_join not restored the exec mask and tbuffer_store became idle on some threads.

0 Kudos
Reply
Challenger
Challenger

Re: Subroutines on the 7970 ISA

Jump to solution

Re: 1,2, and 3.

Yes, I see the same. Certain mask combinations fail. After some testing, it looks like fork gives priority to the minor thread. If <=32 threads fork, they run first and the remaining threads are put on the stack for the join instruction. If >32 threads fork, the non-forking threads run first and the forking threads are put on the stack. Maybe there needs to be a join at the end of each branch? Especially with 3 or more. I think your right, these are there for special kinds of problems.

I wonder if anyone knows when AMD will release more detailed information on GCN? Should be soon, no?

0 Kudos
Reply
Miniboss
Miniboss

Re: Subroutines on the 7970 ISA

Jump to solution

Just have discovered s_movrel (the minihelp in the .dll is a bit misleading, tho)

Tried to write this simple thing:

int uav[], dstIdx=0, maxIdx=20;

void fibonacci(int a,b)

{

  uav[dstIdx++]=a+b;

  if(dstIdx<maxIdx) fibonacci(b,a+b);

}

main() {fibonacci(1,1);}

..using the S Alu of the Tahiti chip, and it worked  

The stack pointer is the "m0" register initialized to 104

every call pushes 2 parameters and a 64bit return addres. (btw on a 3GB card it's not necessary to push the high dword of addresses, I think those are absolute addresses)

@Fibonacci:

  //entry code

  s_sub_u32 m0, m0, 2 \ s_movreld_b64 s0, s0      //push return addr s[0:1]

  s_movrels_b32 s0, s3                             //get 1st param //s0=ret_addr

  s_movrels_b32 s1, s2                             //get 2nd param

  //do fibonacci  //^^ those are just indices, not the actual contents of SRegs

  s_add_i32     s2, s0, s1                           

  //write the result

  v_writelane_b32 v1,s2,0

  tbuffer_store_format_x v1, v0, uav, 0 offen format:[BUF_DATA_FORMAT_32,BUF_NUM_FORMAT_FLOAT]

  v_add_i32     v0, vcc, 4, v0                          //increment dst offset

  v_cmp_le_u32  vcc, maxAddr, v0                   //limit recursion

  s_cbranch_vccnz @nomore

      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s1       //push 1st param

      s_sub_u32 m0,m0,1 \ s_movreld_b32 s0, s2       //push 2nd param

      s_swappc_b64 s[0:1], Fibonacci                           //call recursive

  @nomore:

  s_movrels_b64 s0,s0 \ s_add_u32 m0,m0,2         //pop return addr

  s_add_u32 m0,m0,2                               //clear parameters

  s_setpc_b64 s[0:1]                              //ret

//results: 2,       3,       5,       8,      13,      21, ...

There are every building blocks present in order to make a fully functional C compiler for the S-ALU. That's the real 'General Purpose'

0 Kudos
Reply
Challenger
Challenger

Re: Subroutines on the 7970 ISA

Jump to solution

Now, that's nice, fully recursive using 20 lines of assembly code..

Using the S-ALU, maybe full computers can be made without Intel cores.

Tahiti has 32 S-ALUs, should be enough.

0 Kudos
Reply
Miniboss
Miniboss

Re: Subroutines on the 7970 ISA

Jump to solution

Yea, the same I was thinking of , imagine a slow 4core x86 at 1GHz with (more or less) 64bit arithmetic and 8GB/sec mem bandwith but it has 2048bit SSE! Plus you can combine a legacy instruction with an SSE instruction in a single cycle (with some code size restrictions, but those are exists on x86 too).

Put 32 of it on a single chip, and there goes the 7970.

Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha

0 Kudos
Reply

Re: Subroutines on the 7970 ISA

Jump to solution

We are finalizing the documentation for the ISA and it will be released after this is done.

0 Kudos
Reply
Miniboss
Miniboss

Re: Subroutines on the 7970 ISA

Jump to solution

Oh, I'm really happy to hear this!

0 Kudos
Reply
Challenger
Challenger

Re: Subroutines on the 7970 ISA

Jump to solution

realhet wrote

Poor x86 has to emulate every old deprecated thing since the 1980's and it became so complicated that it's basically a microprocessor inside a microprocessor. But the new GCN instruction set was developed from scratch and thus producing raw power with minimal complications, I like it haha

yes, and 86 still has the cl register from the old traffic light controller days.

GCN has a lot of cleaver features, so, perhaps AMD will conquer the world after all !

0 Kudos
Reply