AnsweredAssumed Answered

How RV7xx CU works?

Question asked by yuvikarti on Apr 14, 2013
Latest reply on Apr 14, 2013 by yuvikarti

Hello all,


This might the same question asked again. I don't understand how the CU works, particularly the threads. I'm a newbie to GPU programming and I'm trying to understand the working R700.


I understand that


each CU(SIMD pipeline) has 16 Thread processors or 16 VLIW cores or 16 lanes.


Okay, here are my assumptions:


assumption 1.




clause1: add1, add2, mul1, div1

clause2: add1, add2, add3, div1


a CU has 64 threads which is divided into 4 wavefronts.


Wavefront 1: thread 0 - thread 15

Wavefront 2: thread 16 - thread 31

Wavefront 3: thread 32 - thread 47

Wavefront 4: thread 48 - thread 64


they say, same instruction repeats 4 times(4 cycles) on an SIMD pipeline, also in every lane.


does that mean,


wavefront 1 is executed in 1st cycle

wavefront 2 is executed in 2nd cycle

wavefront 3 is executed in 3rd cycle

wavefront 4 is executed in 4th cycle



assumption 2:


some say that a group of 64 threads is 1 wavefront. I read the R700 ISA guide. And I read this phrase under Types of Shared REgisters:

"shared registers enables data sharing between threads residing in a lane of different wavefronts and that are scheduled to execute on a given SIMD."

i was like "wait... what?!" my right eye twitched for 2 seconds.


the phrase "threads residing in a lane of different wavefronts", does it mean:


WFs     |     Threads/TP0(lane0)     |     Threads/TP1(lane1)     | ...

WF0     |     T0                    |     T1

WF1     |     T16                  |     T17

WF2     |     T32                  |     T33

WF3     |     T48                  |     T49




does different wavefronts in the sense 4 wavefronts aligned in a pipelined fashion?

say while T0-T15 is executing ALU and T16-T31 performs fetching(internally)? is this how R700 h/w is trying to improve performance?

and why evenwavefront sections for clause temp GPRs?


assumption 3:


some say that there are total 64 threads per CU. the threads are divided into 16 quads(2x2 threads). I guess the quad term is more deep.


say each quad is executed in one VLIW core over the 4 cycles. does this how it works?



i guess the all the assumptions are kinda similar. and i'm confused. please halp.




And advance thank you for helping