This might the same question asked again. I don't understand how the CU works, particularly the threads. I'm a newbie to GPU programming and I'm trying to understand the working R700.
I understand that
each CU(SIMD pipeline) has 16 Thread processors or 16 VLIW cores or 16 lanes.
Okay, here are my assumptions:
clause1: add1, add2, mul1, div1
clause2: add1, add2, add3, div1
a CU has 64 threads which is divided into 4 wavefronts.
Wavefront 1: thread 0 - thread 15
Wavefront 2: thread 16 - thread 31
Wavefront 3: thread 32 - thread 47
Wavefront 4: thread 48 - thread 64
they say, same instruction repeats 4 times(4 cycles) on an SIMD pipeline, also in every lane.
does that mean,
wavefront 1 is executed in 1st cycle
wavefront 2 is executed in 2nd cycle
wavefront 3 is executed in 3rd cycle
wavefront 4 is executed in 4th cycle
some say that a group of 64 threads is 1 wavefront. I read the R700 ISA guide. And I read this phrase under Types of Shared REgisters:
"shared registers enables data sharing between threads residing in a lane of different wavefronts and that are scheduled to execute on a given SIMD."
i was like "wait... what?!" my right eye twitched for 2 seconds.
the phrase "threads residing in a lane of different wavefronts", does it mean:
WFs | Threads/TP0(lane0) | Threads/TP1(lane1) | ...
WF0 | T0 | T1
WF1 | T16 | T17
WF2 | T32 | T33
WF3 | T48 | T49
does different wavefronts in the sense 4 wavefronts aligned in a pipelined fashion?
say while T0-T15 is executing ALU and T16-T31 performs fetching(internally)? is this how R700 h/w is trying to improve performance?
and why evenwavefront sections for clause temp GPRs?
some say that there are total 64 threads per CU. the threads are divided into 16 quads(2x2 threads). I guess the quad term is more deep.
say each quad is executed in one VLIW core over the 4 cycles. does this how it works?
i guess the all the assumptions are kinda similar. and i'm confused. please halp.
And advance thank you for helping