Originally posted by: rexiaoyu The user guide says that, "in a thread processor, up to 4 threads can issue 4 VLIW instruction over 4 cycles. ..For example, the 16 thread processors execute the same instructions, with each thread processor processing 4 threads at a time, this appears as a 64-wide SIMD engine". Why each thread processor can process 4 thread at a time? What is the meaning of "at a time"? Obviously not in one clock.
My somewhat educated guess is that the thread processor switches between 4 active threads to hide memory access latency. If 1 thread is waiting for memory access, it will switch over to an active one.
Read section 1.2.7 Stream Processor Scheduling in the Stream Computing User Guide.
Thank you. If all the 4 threads are de-active in the thread processor, then how will the thread processor schedule? loading 4 new threads to execute? If so, and the 4 new threads stall again before the previous threads become active, the thread processor will load another 4 again? Then how many threads can the thread processor load at most?
Is there some kind of "ready queue" and "stall queue" for the thread processor? The ready threads go into the ready queue, while the de-active threads go into the stall queue, and the thread processor switchs them between 2 queues?
'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.
Originally posted by: MicahVillmow rexiaoyu, 'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.
As you said, in a wavefront, assume threads 0, 16, 32 and 48 are executed by a thread processor, they are executed sequentially over 4 cycles (T0 cycle 0, T16 cycle 1, T32 cycle 2, T48 cycle3). But in stream user guide 1.2.7 stream processor scheduling section, it seems that T16 won't execute until T0 is stalled.
That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.
Originally posted by: MicahVillmow rexiaoyu, That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.
But the figure of the section 1.2.7 is "Simplified Execution Of Threads On A Single Thread Processor", it is talking about the threads scheduling within a quad in a thread processor, and of course within a wavefront.
According to the figure, the threads in a thread processor are scheduled in block multithreading manner, which means the thread won't be scheduled until it is stalled by a memory access. But in other places of the forum, it appears to me the threads are interleaved, which means the thread is scheduled every cycle(e.g. 4 threads in a thread processor at a time, cycle 0 T0, cycle 1 T1, cycle 2 T2, cycle 3 T3).
And I don't understand how the wavefronts are scheduled either. Some people say they are interleaved, right? 4 cycles for wavefront A, and another 4 cycles for wavefront B?
In that example, T0, T1, etc.. can be thought of as a single wavefront, or a hardware thread. Also, if you will notice, it is under the section stream processor scheduling, so it definitely is wavefronts and not individual threads. I'll see if I can get the wording corrected.
Thanks, I will think it over.
My question still stands:
Are the thread and wavefront scheduled in interleaved manner? The doc doesn't mention much about this.
The following is referenced from other threads in this forum:
1. From "Wavefront Question" and I made a little change(marked with underline)
"The execution domains is broken into blocks of 64 threads, called wavefronts, and schedules them to execute on a SIMD.
When executing on a SIMD, each wavefront is broken into 4 groups of 16, with each group executing on the four 2x2 blocks of thread processors per SIMD.
Each thread processor processes a 5-way VLIW instructions for a single thread, also called an ALU clause, or ALU bundle, or instruction group. 4 threads are interleaved over 4 cycles.
A wavefront continues executing on a simd for that ALU CF clause, which is made up of a lot of ALU bundles (up to 128), where it then returns to the thread dispatcher until it is scheduled to execute again."
2. From "Calculating the Bottleneck"
"The ALUs execute a pair of wavefronts at any one time, in a pattern of cycles that goes AAAABBBBAAAA....As for the wavefronts, although it is seen as 4 instr over 4 cycles, that assumes that both wavefronts are executing in parallel. So, wavefront A executes 4 instr over 4 cylces, then wavefront B executes, then A, then B. If B does not exist, A only executes every 8 cycles and not every 4."
According to 1, when the wavefront finishs its clause, it will return to the dispatcher and wait for scheduling to execute again.
According to 2, the wavefront will be interleaved every 4 cycles.
Sounds like conficting?
They are not conflicting but describing scheduling at different points of execution. 1 references scheduling of wavefronts with respect to the ALU control flow clause, 2 references how the wavefronts execute an ALU bundle. There can be up to 128 ALU instructions packed into a max of 128 ALU bundles in a single ALU control flow clause.
yeap, I got it. Thank you, Micah.
It can be thought that within an ALU CF clause, a pair of wavefront is interleaved every 4 cycles, and the wavefront returns to dispatcher waiting for scheduling every ALU CF clause. And within a quad 4 threads are interleaved every cycle.
I will assume so if you don't tell me otherwise
I don't understand why there are odd wavefront and even wavefront, can you explain this?
That is close but not quite correct. On a single simd, assuming full capacity, there are two wavefronts executing in 'parallel', it actually switches between wavefronts while the other wavefront is finishing execution. Within a quad, 4 threads are executed every cycle, not one per cycle.
Within a quad, 4 threads are executed every cycle, not one per cycle? It doesn't make sense to me. Maybe I didn't get the point.
For example, here is a ALU CF clause, including ALU bundle 25 - 28:
02 ALU: ADDR(274) CNT(100)
25 x: MUL T1.x, R2.x, (0x437F0000, 255.0f).x
y: MUL T0.y, R2.y, (0x437F0000, 255.0f).x
z: MUL T0.z, R2.z, (0x437F0000, 255.0f).x
w: MUL T0.w, R0.x, (0x437F0000, 255.0f).x VEC_120
t: MUL T1.y, R0.y, (0x437F0000, 255.0f).x
26 x: DOT4 ____, PV25.x, (0x3DE978D5, 0.1140000001f).x
y: DOT4 R12.y, PV25.y, (0x3F1645A2, 0.5870000124f).y
z: DOT4 ____, PV25.z, (0x3E991687, 0.298999995f).z
w: DOT4 ____, (0x80000000, 0.0f).w, 0.0f
27 x: DOT4 ____, T1.x, 0.5
y: DOT4 ____, T0.y, (0xBEA99AE9, -0.3312599957f).x
z: DOT4 R0.z, T0.z, (0xBE2CCA2E, -0.1687400043f).y
w: DOT4 ____, R0.w, (0x43008000, 128.5f).z
t: MUL T1.z, R0.z, (0x437F0000, 255.0f).w
28 x: DOT4 ____, T1.x, (0xBDA685DB, -0.08130999655f).x
y: DOT4 ____, T0.y, (0xBED65E89, -0.418689996f).y
z: DOT4 ____, T0.z, 0.5
w: DOT4 R4.w, R0.w, (0x43008000, 128.5f).z
t: MUL T1.x, R3.x, (0x437F0000, 255.0f).w
In my opinion, every bundle will be issued as a 5-way VLIW instruction, and each way is executed by a stream core of a thread processor.
Within a quad, assuming there are threads T0, T1, T2, T3, it will be executed like this:
cycle 0: 25 from T0 is executed,
cycle 1: 25 from T1 is executed,
cycle 2: 25 from T2 is executed,
cycle 3: 25 from T3 is executed,
The following 4 cycles (cycle 4 - 7)are occupied by another wavefront ,
cycle 8: 26 from T0 is executed,
cycle 9: 26 from T1 is executed,
cycle 10: 26 from T2 is executed,
cycle 11: 26 from T3 is executed,and so on ,until the whole ALU CF clause is finished.Is anything wrong?
It is just a way to hide latency and execute more threads in parallel.
Yes, T0-T3 execute in parallel, not sequentially. The reason is that within a wavefront, there are 64 threads and on the high-end radeon boards, there are 16 thread processors, setup as 4 quads. Each quad executes 4 threads in parallel, so 16 threads can be executed in parallel. In this discussion, lets assume that those 16 threads setup inputs, execute instruction and write out results in 1 cycle. That means that it takes four cycles to execute a VLIW for all the threads in a wavefront. There are two wavefronts executing in parallel, so we have 8 cycles. These are interleaved on the even and odd cycles for the even and odd wavefront. This repeats until all ALU bundles in the ALU CF clause is emptied.
oh, I think I get a little confused. "T0 - T3 execute in parallel, not sequentially", can be thought that, each thread constains the same instructions and at some time one VLIW instruction is issued for 4 threads, executing on 4 different data (indicating SIMD), still taking the above for example, and assuming one instruction is finished in 1 cycle,
cycle 0, 25 is issued,
cycle 1, 26 is issued (25 for T0 is finished),
cycle 2, 27 is issued (25 for T1 is finished),
cycle 3, 28 is issued (25 for T2 is finished),
cycle 4, 25 is issued (25 for T3 is finished),
and so on.
If there is anything wrong , please help me to modify the above sequence. Thank you.