AnsweredAssumed Answered

GCN Instructions scheduling clarification

Question asked by cdg2lax on Feb 16, 2013
Latest reply on Feb 18, 2013 by jesikakorla

Hello All,

 

Disclaimer: I am new to GPU programming (started on Tuesday but not to parrallel programing or SIMD programming (Connection Machine 16K 1 bit processors running in SIMD fashion... I am dating myself here ;-) .

Also I did read the white paper 2620_final and the southern island ISA plus any other papers/presentations I could put my hands on...

 

I have a some questions regarding the way instructions are schedule on a GCN CU.

 

What I have understood so far:

A CU has 4 vector units 1 scalar and 1 LDS  and emit one instruction per cycles to each but

vector unit i consume 4 times the same instruction from  wavefront wf-ai (4 cycles) (i in {0,1,2,3})

The scalar unit get one instruction per cycle from 4 different wf-b0, wf-b1, wf-b2 and wf-b3. So seen from one vector unit one scalar instructions could be executed per 4 cycles if it belong to a wavefront not running on the vector unit.

 

Question 1: Could a wavefront with many successive scalar instructions (and no other wavefront in the CU in position to execute a scalar instruction)
run more than one scalar per 4 cycles group? (I will guess not, if the scalar unit as the same needs to hide pipeline latency than the vector units.)

 

Question 2: Could a LDS instruction and a scalar instruction belonging to the same wavefront run in the same time 4 cycles group?

                 Same question for LDS and Vector instructions of the same wavefront.

 

Question 3; Wavefront priority: does the priority affect LDS and scalar instruction dispatch over the whole CU or just on a per vector unit basis?

 

Question 5: On a Vector unit with two wavefront executing on it (same priority) could it be assumed that they will alternate every 4 cycles (no GDS or LDS pending execution or conflict)?

 

Questions 6: 64 bits instructions on the vector unit take 4 cycles (16 cycles for a wavefront) I understand it as a pure stall for that Vector unit but could 4 or more Scalar and LDS instructions belonging to other wavefront running on the same vector unit been executed during those cycles? 

 

Voila, I guess that will be it for now...

 

Thanks,

 

Eric L.

PS: The targeted app is of symbolic nature (marginal use of floating point) and it his latency sensible.

Outcomes