cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rexiaoyu
Journeyman III

Why each thread processor can process 4 threads at a time?

The user guide says that, "in a thread processor, up to 4 threads can issue 4 VLIW instruction over 4 cycles. ..For example, the 16 thread processors execute the same instructions, with each thread processor processing 4 threads at a time, this appears as a 64-wide SIMD engine". Why each thread processor can process 4  thread at a time? What is the meaning of "at a time"? Obviously not in one clock. 

0 Likes
17 Replies
frankas
Journeyman III

Originally posted by: rexiaoyu The user guide says that, "in a thread processor, up to 4 threads can issue 4 VLIW instruction over 4 cycles. ..For example, the 16 thread processors execute the same instructions, with each thread processor processing 4 threads at a time, this appears as a 64-wide SIMD engine". Why each thread processor can process 4  thread at a time? What is the meaning of "at a time"? Obviously not in one clock. 

 

My somewhat educated guess is that the thread processor switches between 4 active threads to hide memory access latency. If 1 thread is waiting for memory access, it will switch over to an active one.

Read section 1.2.7 Stream Processor Scheduling in the Stream Computing User Guide.

 

0 Likes

frankas,

Thank you. If all the 4 threads are de-active in the thread processor, then  how will the thread processor schedule? loading 4 new threads to execute? If so, and the 4 new threads stall again before the previous threads become active, the thread processor will load another 4 again? Then how many threads can the thread processor load at most?

Is there some kind of "ready queue" and "stall queue" for the thread processor? The ready threads go into the ready queue, while the de-active threads go into the stall queue, and the thread processor switchs them between 2 queues?

0 Likes

rexiaoyu,
'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.
0 Likes

Originally posted by: MicahVillmow rexiaoyu, 'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.


As you said, in a wavefront, assume threads 0, 16, 32 and 48 are executed by a thread processor, they are executed sequentially over 4 cycles (T0 cycle 0, T16 cycle 1, T32 cycle 2, T48 cycle3). But in stream user guide 1.2.7 stream processor scheduling section, it seems that T16 won't execute until T0 is stalled.

0 Likes

rexiaoyu,
That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.
0 Likes

Originally posted by: MicahVillmow rexiaoyu, That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.


Micah,

But the figure of  the section 1.2.7 is "Simplified Execution Of Threads On A Single Thread Processor", it is talking about the threads scheduling within a quad in a thread processor, and of course within a wavefront.

According to the figure, the threads in a thread processor are scheduled in block multithreading manner, which means the thread won't be scheduled until it is stalled by a memory access. But in  other places of the forum,  it appears to me the threads are interleaved, which means the thread is scheduled every cycle(e.g. 4 threads in a thread processor at a time, cycle 0 T0, cycle 1 T1, cycle 2 T2, cycle 3 T3).

And I don't understand how the wavefronts are scheduled either. Some people say they are interleaved, right? 4 cycles for wavefront A, and another 4 cycles for wavefront B?

0 Likes

rexiaoyu,
In that example, T0, T1, etc.. can be thought of as a single wavefront, or a hardware thread. Also, if you will notice, it is under the section stream processor scheduling, so it definitely is wavefronts and not individual threads. I'll see if I can get the wording corrected.
0 Likes

Micah, 

Thanks, I will think it over.

My question still stands:

Are the thread and wavefront scheduled in interleaved manner? The doc doesn't mention much about this.

0 Likes

Micah, 

The following is referenced from other threads in this forum:

1.  From "Wavefront Question" and I made a little change(marked with underline)

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=108669&highlight_key=y&keyword1=wa...

"The execution domains is broken into blocks of 64 threads, called wavefronts, and schedules them to execute on a SIMD. 

When executing on a SIMD, each wavefront is broken into 4 groups of 16, with each group executing on the four 2x2 blocks of thread processors per SIMD. 
Each thread processor processes a 5-way VLIW instructions for a single thread, also called an ALU clause, or ALU bundle, or instruction group. 4 threads are interleaved over 4 cycles.

A wavefront continues executing on a simd for that ALU CF clause, which is made up of a lot of ALU bundles (up to 128), where it then returns to the thread dispatcher until it is scheduled to execute again."

2. From "Calculating the Bottleneck"

http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=115872&STARTPAGE=2&FTVAR_FORUMVIEW...

"The ALUs execute a pair of wavefronts at any one time, in a pattern of cycles that goes AAAABBBBAAAA....

As for the wavefronts, although it is seen as 4 instr over 4 cycles, that assumes that both wavefronts are executing in parallel. So, wavefront A executes 4 instr over 4 cylces, then wavefront B executes, then A, then B. If B does not exist, A only executes every 8 cycles and not every 4."

According to 1, when the wavefront finishs its clause, it will return to the dispatcher and  wait for scheduling to execute again.

According to 2, the wavefront will be interleaved every 4 cycles.

Sounds like conficting?

 

0 Likes

rexiaoyu,
They are not conflicting but describing scheduling at different points of execution. 1 references scheduling of wavefronts with respect to the ALU control flow clause, 2 references how the wavefronts execute an ALU bundle. There can be up to 128 ALU instructions packed into a max of 128 ALU bundles in a single ALU control flow clause.
0 Likes

yeap, I got it. Thank you, Micah.

It can be thought that within an ALU CF clause, a pair of wavefront is interleaved every 4 cycles, and the wavefront returns to dispatcher waiting for scheduling every ALU CF clause. And within a quad 4 threads are interleaved every cycle. 

I will assume so if you don't tell me otherwise

 

0 Likes

I don't understand why there are odd wavefront and even wavefront, can you explain this? 

0 Likes

rexaioyu,
That is close but not quite correct. On a single simd, assuming full capacity, there are two wavefronts executing in 'parallel', it actually switches between wavefronts while the other wavefront is finishing execution. Within a quad, 4 threads are executed every cycle, not one per cycle.
0 Likes

Within a quad, 4 threads are executed every cycle, not one per cycle? It doesn't make sense to me. Maybe I didn't get the point.

For example, here is a ALU CF clause, including ALU bundle 25 - 28:

02 ALU: ADDR(274) CNT(100) 

     25  x: MUL         T1.x,  R2.x,  (0x437F0000, 255.0f).x      

         y: MUL         T0.y,  R2.y,  (0x437F0000, 255.0f).x      

         z: MUL         T0.z,  R2.z,  (0x437F0000, 255.0f).x      

         w: MUL         T0.w,  R0.x,  (0x437F0000, 255.0f).x      VEC_120 

         t: MUL         T1.y,  R0.y,  (0x437F0000, 255.0f).x      

     26  x: DOT4        ____,  PV25.x,  (0x3DE978D5, 0.1140000001f).x      

         y: DOT4        R12.y,  PV25.y,  (0x3F1645A2, 0.5870000124f).y      

         z: DOT4        ____,  PV25.z,  (0x3E991687, 0.298999995f).z      

         w: DOT4        ____,  (0x80000000, 0.0f).w,  0.0f      

     27  x: DOT4        ____,  T1.x,  0.5      

         y: DOT4        ____,  T0.y,  (0xBEA99AE9, -0.3312599957f).x      

         z: DOT4        R0.z,  T0.z,  (0xBE2CCA2E, -0.1687400043f).y      

         w: DOT4        ____,  R0.w,  (0x43008000, 128.5f).z      

         t: MUL         T1.z,  R0.z,  (0x437F0000, 255.0f).w      

     28  x: DOT4        ____,  T1.x,  (0xBDA685DB, -0.08130999655f).x      

         y: DOT4        ____,  T0.y,  (0xBED65E89, -0.418689996f).y      

         z: DOT4        ____,  T0.z,  0.5      

         w: DOT4        R4.w,  R0.w,  (0x43008000, 128.5f).z      

         t: MUL         T1.x,  R3.x,  (0x437F0000, 255.0f).w      

In my opinion, every bundle will be issued as a 5-way VLIW instruction, and each way is executed by a stream core of a thread processor.

Within a quad, assuming there are threads T0, T1, T2, T3, it will be executed like this:

cycle 0: 25 from T0 is executed,

cycle 1: 25 from T1 is executed,

cycle 2: 25 from T2 is executed,

cycle 3: 25 from T3 is executed,

The following 4 cycles (cycle 4 - 7)are occupied by another wavefront ,

and then

cycle 8: 26 from T0 is executed,

cycle 9: 26 from T1 is executed,

cycle 10: 26 from T2 is executed,

cycle 11: 26 from T3 is executed,

and so on ,until the whole ALU CF clause is finished.
Is anything wrong?




0 Likes

rexiaoyu,
It is just a way to hide latency and execute more threads in parallel.
0 Likes

Yes, T0-T3 execute in parallel, not sequentially. The reason is that within a wavefront, there are 64 threads and on the high-end radeon boards, there are 16 thread processors, setup as 4 quads. Each quad executes 4 threads in parallel, so 16 threads can be executed in parallel. In this discussion, lets assume that those 16 threads setup inputs, execute instruction and write out results in 1 cycle. That means that it takes four cycles to execute a VLIW for all the threads in a wavefront. There are two wavefronts executing in parallel, so we have 8 cycles. These are interleaved on the even and odd cycles for the even and odd wavefront. This repeats until all ALU bundles in the ALU CF clause is emptied.
0 Likes

oh, I think I get a little  confused. "T0 - T3 execute in parallel, not sequentially", can be thought that, each thread constains the same instructions and at some time one VLIW instruction is issued for 4 threads, executing on 4 different data (indicating SIMD), still taking the above for example, and assuming one instruction is finished in 1 cycle,

Wavefont A:

cycle 0, 25 is issued,

cycle 1, 26 is issued (25 for T0 is finished),

cycle 2, 27 is issued (25 for T1 is finished),

cycle 3, 28 is issued (25 for T2 is finished),

Wavefront B:

cycle 4, 25 is issued (25 for T3 is finished),

...

and so on.

If there is anything wrong , please help me to modify the above sequence. Thank you.

0 Likes