Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

GCN Instructions scheduling clarification

Hello All,

Disclaimer: I am new to GPU programming (started on Tuesday but not to parrallel programing or SIMD programming (Connection Machine 16K 1 bit processors running in SIMD fashion... I am dating myself here 😉 .

Also I did read the white paper 2620_final and the southern island ISA plus any other papers/presentations I could put my hands on...

I have a some questions regarding the way instructions are schedule on a GCN CU.

What I have understood so far:

A CU has 4 vector units 1 scalar and 1 LDS  and emit one instruction per cycles to each but

vector unit i consume 4 times the same instruction from  wavefront wf-ai (4 cycles) (i in {0,1,2,3})

The scalar unit get one instruction per cycle from 4 different wf-b0, wf-b1, wf-b2 and wf-b3. So seen from one vector unit one scalar instructions could be executed per 4 cycles if it belong to a wavefront not running on the vector unit.

Question 1: Could a wavefront with many successive scalar instructions (and no other wavefront in the CU in position to execute a scalar instruction)
run more than one scalar per 4 cycles group? (I will guess not, if the scalar unit as the same needs to hide pipeline latency than the vector units.)

Question 2: Could a LDS instruction and a scalar instruction belonging to the same wavefront run in the same time 4 cycles group?

                 Same question for LDS and Vector instructions of the same wavefront.

Question 3; Wavefront priority: does the priority affect LDS and scalar instruction dispatch over the whole CU or just on a per vector unit basis?

Question 5: On a Vector unit with two wavefront executing on it (same priority) could it be assumed that they will alternate every 4 cycles (no GDS or LDS pending execution or conflict)?

Questions 6: 64 bits instructions on the vector unit take 4 cycles (16 cycles for a wavefront) I understand it as a pure stall for that Vector unit but could 4 or more Scalar and LDS instructions belonging to other wavefront running on the same vector unit been executed during those cycles? 

Voila, I guess that will be it for now...


Eric L.

PS: The targeted app is of symbolic nature (marginal use of floating point) and it his latency sensible.

4 Replies


1: It can't. If you write 2 scalar instructions adjacent to each, then you simply lose a vector one. As I know it works in a cool coreography: there are phase differences between the 4 vector units (4 stage pipeline) so in each cycle the S unit belongs to a specific V unit only.

The goal is that you should insert S instructions into the V instruction stream in a way, that it doesn't make stall the V instructions at all.

Here are some rules:

- Don't place an S right after an S.

- Don't place an S right after a V which is writing into the S registers. (for example v_add_i32 which writer carry to S regs)

- Don't let 64bit instructions to be concentrated at on location, because there is a limit in the instruction decoder: Denser code needs more threads to hide this thing. Basically In an SVSVSVS pattern don't put a 64bit instruction right after another 64bit one. I mean, you could use more than one 64bit instr near, but later use more 32bit ones.

- The S and V instructions are completely separated from each other, you can use the result of an S instr in the next V instr, there'll be at least 4cycle gap between the S and the corresponding V instr.

- If VRegCount exceeds 128 (256 is the max), then the S unit will stall the vector units so badly. If you go below 129 then it means 2x more wavefronts, and that hides this kind of stall.

2: An LDS and S instructions can executed in 'paralell' like an S and a V. V cannot be combined with LDS because both uses VRegs. But LDS the result of an LDS operation is not immediate, you have to wait for it with s_wait instruction (you can do like 6-8 V instructions before the s_wait if you want).

3: I think this priority is managed on a higher level. -> Page 4 of GCN_2620_final.pdf  "work distributor". That distributes wavefronts across CUes.

5: 2 WF is not enough for the 4 V units. There should be 2*4 64threaded WFs to get peak performance out of V and S alu. So on a 2048steamed Tahiti you need at least 8192 threads, of if there are lot's of S instructions then 16K is better. (or 4x 128 threaded WF can be an option too per CU)

6: Those special instructions generating stalls. I'm not sure, but I think all the 64bit ones (including int and float) are bound to the Douple Precision stream processors. There are fewer of those than the 32bit V streams. On Tahiti the DP:SP ratio is 1:4. They work only exclusively, because the work with one register array.

("Connection Machine 16K 1 bit processors" I've looked up that on Wiki, those photographs with the many thousand processor usage leds. A radeon would be fun with a few hundred leds displaying actual CU usage )


Thanks for your prompt answers.

As I am not interested in throughtput  but purely latency I need to understand precisely the minmal execution time and the caveats associated with each instructions.

So does a paper or web page describe the timing of an instruction for Tahiti chip as I will be working with a S9000?

Also the syntax to describe instructions in the ISA document are more often than not criptic: For example  V_ALIGNBYTE_B32 is described as

D.u = ({S0, S1} >>  8*S2.u[4:0])) & 0xFFFFFFFF

4:0 represent 5 bits I figure that one of the shift is 2 bits and the other is a sign 3 bits number in  order to enable left shifting for one of S0 or S1 VGRP.

But that is probably whishfull thinking. Where  is the syntax described more precisely?




I don't know about an exact documentation of that for each instruction.

If I don't know what slow an instruction is, I try measuring clock cycles with s_memtime. From now on a 1 'clock' is when s_memtime reports a difference of 1, when adding or removing a thing from the instr stream.

With s_memtime I can tell that a good VSVSVSVS instruction sequence is executed in sum(V) clocks, or there is a penalty there.

Basically there are 3 kinds of instructions (not talking about S, because if you do it well, then they executed transparently):

1 cycle: the simple 32bit ones: no matter if 1,2,or 3 operands.

4 cycle: the 64bit ones, or the complicated ones: precise sin/cos (there are coarse sin/cos as well in 1 clock)

many cycles: I think the complicated graphic things are using many clocks (for example the LIT lighting calculation instruction), but I'm just speculating, never tried...

1 cycle, but delayed completion with many cycles: these are LDS ops (around 5 clocks), memory ops (so many clocks)

Check the s_wait instruction -> it is to synchronize those very long memory/lds/gds latencyes (because the V/S stream still work while they're executes)

And there are penalties also:

If you put an S instruction after a V which writes s-regs (like compare ops, or add_i32) then it will punish you with +4 clocks. So a good SV pair will be executed in 1 cycle, and the above will take S-reg thing will take 5 cycles in total (reported by s_memtime).

Also there are some conflicts that you have to watch out for, and use either filling V instructions, or using the nop instruction (check in the manual! There are recommended nop times for different things, maybe there you can find some latency info you wanted)

V_ALIGNBYTE_B32 is described as

D.u = ({S0, S1} >>  8*S2.u[4:0])) & 0xFFFFFFFF

I think they just copy+pasted the description of V_ALIGNBIT_B32, there needed the 5 bits, here we need only 3 as you noticed.

And {S0, S1} means S0.u+S1.u<<32, a 64bit number, you can address a dword inside with the 3 LSB of S2.

Journeyman III

I am so happy to read this. This is the kind of manual that needs to be given and not the random misinformation that's at the other sites. I really like it! I'll always appreciate your brief sharing in this awesome stuffs sincerely, this discussion has put light on this topic. Thanks for sharing this.

Message was edited by Doron Ofek: Removed commercial link.