This might the same question asked again. I don't understand how the CU works, particularly the threads. I'm a newbie to GPU programming and I'm trying to understand the working R700.
I understand that
each CU(SIMD pipeline) has 16 Thread processors or 16 VLIW cores or 16 lanes.
Okay, here are my assumptions:
clause1: add1, add2, mul1, div1
clause2: add1, add2, add3, div1
a CU has 64 threads which is divided into 4 wavefronts.
Wavefront 1: thread 0 - thread 15
Wavefront 2: thread 16 - thread 31
Wavefront 3: thread 32 - thread 47
Wavefront 4: thread 48 - thread 64
they say, same instruction repeats 4 times(4 cycles) on an SIMD pipeline, also in every lane.
does that mean,
wavefront 1 is executed in 1st cycle
wavefront 2 is executed in 2nd cycle
wavefront 3 is executed in 3rd cycle
wavefront 4 is executed in 4th cycle
some say that a group of 64 threads is 1 wavefront. I read the R700 ISA guide. And I read this phrase under Types of Shared REgisters:
"shared registers enables data sharing between threads residing in a lane of different wavefronts and that are scheduled to execute on a given SIMD."
i was like "wait... what?!" my right eye twitched for 2 seconds.
the phrase "threads residing in a lane of different wavefronts", does it mean:
WFs | Threads/TP0(lane0) | Threads/TP1(lane1) | ...
WF0 | T0 | T1
WF1 | T16 | T17
WF2 | T32 | T33
WF3 | T48 | T49
does different wavefronts in the sense 4 wavefronts aligned in a pipelined fashion?
say while T0-T15 is executing ALU and T16-T31 performs fetching(internally)? is this how R700 h/w is trying to improve performance?
and why evenwavefront sections for clause temp GPRs?
some say that there are total 64 threads per CU. the threads are divided into 16 quads(2x2 threads). I guess the quad term is more deep.
say each quad is executed in one VLIW core over the 4 cycles. does this how it works?
i guess the all the assumptions are kinda similar. and i'm confused. please halp.
And advance thank you for helping
I guess, I found the answer.
1 Wavefront is 64 threads. for R700, R600, Northern Islands and Evergreen.
You can run up to 1024 threads per CU, which is dependent on LDS and/or GPRs.
So, this is similar to time-sharing system, where numerous threads wait for its turn to execute with one CPU and there has to be synchronization of resources. (such as Process State, Thread state which includes CPU state, stack, IO resources used, etc.)
Exception is, this is an SIMD(DPP), its like executing a single instruction over multiple data(resulting 64 wide vector unit). Wait, did i just say the abbreviation of SIMD?.. Yup, it is, infact. And it is Not like executing multiple threads in parallel as in multiple cores of CPUs.
And wavefront is something to be considered for GPU programming especially when it comes to accessing memory/GPRs. In fact, wavefronts are like threads(not exactly, but in CPU POV) which are waiting for its turn to execute and the GPU hardware/software should worry about synchronization.
And the difference between Southern Islands(7 series) and other cards such as Evergreen, Northern Island, R700, R600(6,5,4,3)
Southern islands execute threads from different wavefront in parallel, where 16 from WF0, 16 from WF1, 16 from WF2 and 16 from WF3 are executed in parallel.
Other cards execute threads one wavefront in parallel, where one wavefront is subdivided into 4 SubWavefronts.
You might glitch on some other difference that 7 and 6 series are similar. Nope, they're similar because they have only 4 stream cores per TP(VLIW4 core). where 5,4 and 3 series have 5 stream cores per TP(VLIW5 core).
All I did is get some sleep and I just learned some of the things in LDS.
Well..., my adventure continues