Archives Discussions

maxdz8 · ‎04-17-2015

On section 6.6.1 of APP guide, "hiding ALU and Memory Latency" I read:

The read-after-write latency for most arithmetic operations (a floating-point add, for example) is only four cycles.

Read-after-write... since SI devices take 4 cycles to execute an instruction, what I understand is that they WRITE the result after four clocks, that is, the first 16-WI slice of a result. The register is marked "being written" somehow so it cannot be read.

Or in other terms, I cannot use a value immediately after computation and I must but there at least one interleaving instruction.

Is this correct?

I am not quite sure of GCN ISA but I had some kernels which seem to use wait instructions for no apparent reason.

I've also measured some performance increase in a case where I manually merged two WIs... but again I have no real evidence.

I'll have to write a few things next week so I guessed it was a good time to ask. Thank you for input.

realhet · ‎04-20-2015

I've found an earlier topic about this: Re: How GCN scheduling work??

You can see on the image, that the 4 Vector ALUs are phase shifted to each other. In every cycle: one of the 4 V-ALU starts working on the first quarter on a wavefront, and also one S-ALU instruction starts (not on the picture). It works like a clockwork, it's an awesome design. (On the image one dot means that a workitem has started on stage 0, and you can't see the inflight workitems that are on stage 1..3.)

"scalar instruction after integer add" -> only one thing can write the scalar registers at a time: either a scalar or a vector instruction. Integer vector-additon always stores the 64 carry bits in the scalar regs.

View solution in original post

realhet · ‎04-18-2015

Hi,

This example is absolutely correct, it can harvest the peak performance of the gpu, yet it uses the same register for everything:

v_mad_f32 v0, v0, v0, v0

The 4 cycle latency is for the 4 staged pipeline. If you see only one V ALU (there are 4 of them in a CU) it will process data like this:

stage0, stage1, stage2, stage3

instr0[0..15], idle, idle, idle

instr0[16..31], instr0[0..15], idle, idle

instr0[32..47], instr0[16..31], instr0[0..15], idle

instr0[48..63], instr0[32..47], instr0[16..31], instr0[0..15]

And here we are at 4 cycle latency: the workitems[0..15] of Instr0 is completed, the ALU can continue with the first quarter of the next instruction

instr1[0..15], instr0[48..63], instr0[32..47], instr0[16..31]

...

This is one wavefront running az 1/4 throughput. In a CU there are 4 V ALU, so there you go with the 1 wavefronts/cycle throughput.

The S ALU is also 4 staged and cycles through V ALU 0..3.

I usually to play with the instruction order of inner loop and measure it with s_memtime. Every time I realize that how complex is that, and I dunno, what I'm doing, haha. There are some basic things to keep in mind: No scalar instructions after integer add. Avoid too dense instruction stream: 12 byte/cycle is fine (V and S tohether=12byte).

>I am not quite sure of GCN ISA but I had some kernels which seem to use wait instructions for no apparent reason.

Hmm. I think every wait has a reason that the driver compiles. Maybe in the recent times it uses more of them but they are for a reason:

Long ago it does like this:

load a,b,c,d

wait a,b,c,d (wait 0)

...do the math

And nowadays it is more like this:

load a,b,c,d

wait a,b (wait 2)

add sum,a

add sum,b

wait c,d (wait 0)

add sum,c

add sum,d

The latter has more opportunities to be interleaved with other wavefronts, maybe that's why.

maxdz8 · ‎04-20-2015

This is oddly informative. I can tell an answer is there but I cannot make sense of most of it!

I'm not even sure what is this 4 stage pipeline. It was my understanding a wavefront would stay on a certain SIMD until completed.

I will investigate the "scalar instruction after integer add" thing.

realhet · ‎04-20-2015

I've found an earlier topic about this: Re: How GCN scheduling work??

You can see on the image, that the 4 Vector ALUs are phase shifted to each other. In every cycle: one of the 4 V-ALU starts working on the first quarter on a wavefront, and also one S-ALU instruction starts (not on the picture). It works like a clockwork, it's an awesome design. (On the image one dot means that a workitem has started on stage 0, and you can't see the inflight workitems that are on stage 1..3.)

"scalar instruction after integer add" -> only one thing can write the scalar registers at a time: either a scalar or a vector instruction. Integer vector-additon always stores the 64 carry bits in the scalar regs.

maxdz8 · ‎04-20-2015

I think you understand what you're trying to say.

Basically they share the instruction decode, which pulls out 1V+1S instructions per clock.

Archives Discussions

Exact meaning of ALU latency measurements?