17 Replies Latest reply on Dec 4, 2009 1:26 AM by rexiaoyu

    Why each thread processor can process 4 threads at a time?

    rexiaoyu

      The user guide says that, "in a thread processor, up to 4 threads can issue 4 VLIW instruction over 4 cycles. ..For example, the 16 thread processors execute the same instructions, with each thread processor processing 4 threads at a time, this appears as a 64-wide SIMD engine". Why each thread processor can process 4  thread at a time? What is the meaning of "at a time"? Obviously not in one clock. 

        • Why each thread processor can process 4 threads at a time?
          frankas

           

          Originally posted by: rexiaoyu The user guide says that, "in a thread processor, up to 4 threads can issue 4 VLIW instruction over 4 cycles. ..For example, the 16 thread processors execute the same instructions, with each thread processor processing 4 threads at a time, this appears as a 64-wide SIMD engine". Why each thread processor can process 4  thread at a time? What is the meaning of "at a time"? Obviously not in one clock. 

           

          My somewhat educated guess is that the thread processor switches between 4 active threads to hide memory access latency. If 1 thread is waiting for memory access, it will switch over to an active one.

          Read section 1.2.7 Stream Processor Scheduling in the Stream Computing User Guide.

           

            • Why each thread processor can process 4 threads at a time?
              rexiaoyu

              frankas,

              Thank you. If all the 4 threads are de-active in the thread processor, then  how will the thread processor schedule? loading 4 new threads to execute? If so, and the 4 new threads stall again before the previous threads become active, the thread processor will load another 4 again? Then how many threads can the thread processor load at most?

              Is there some kind of "ready queue" and "stall queue" for the thread processor? The ready threads go into the ready queue, while the de-active threads go into the stall queue, and the thread processor switchs them between 2 queues?

            • Why each thread processor can process 4 threads at a time?
              MicahVillmow
              rexiaoyu,
              'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.
                • Why each thread processor can process 4 threads at a time?
                  rexiaoyu

                   

                  Originally posted by: MicahVillmow rexiaoyu, 'At a time' can be thought of happening in parallel, but in reality, threads 0, 16, 32 and 48 execute sequentially, same with 1, 17, 33, and 49, etc... Threads 0-15 execute in wavefront cycle 0, 16-31 in wavefront cycle 1, 32-47 in wavefront cycle 2 and 48-63 in wavefront cycle 3.


                  As you said, in a wavefront, assume threads 0, 16, 32 and 48 are executed by a thread processor, they are executed sequentially over 4 cycles (T0 cycle 0, T16 cycle 1, T32 cycle 2, T48 cycle3). But in stream user guide 1.2.7 stream processor scheduling section, it seems that T16 won't execute until T0 is stalled.

                • Why each thread processor can process 4 threads at a time?
                  MicahVillmow
                  rexiaoyu,
                  That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.
                    • Why each thread processor can process 4 threads at a time?
                      rexiaoyu

                       

                      Originally posted by: MicahVillmow rexiaoyu, That section of the guide should be discussing scheduling of wavefronts on a simd, not scheduling within a wavefront.


                      Micah,

                      But the figure of  the section 1.2.7 is "Simplified Execution Of Threads On A Single Thread Processor", it is talking about the threads scheduling within a quad in a thread processor, and of course within a wavefront.

                      According to the figure, the threads in a thread processor are scheduled in block multithreading manner, which means the thread won't be scheduled until it is stalled by a memory access. But in  other places of the forum,  it appears to me the threads are interleaved, which means the thread is scheduled every cycle(e.g. 4 threads in a thread processor at a time, cycle 0 T0, cycle 1 T1, cycle 2 T2, cycle 3 T3).

                      And I don't understand how the wavefronts are scheduled either. Some people say they are interleaved, right? 4 cycles for wavefront A, and another 4 cycles for wavefront B?

                    • Why each thread processor can process 4 threads at a time?
                      MicahVillmow
                      rexiaoyu,
                      In that example, T0, T1, etc.. can be thought of as a single wavefront, or a hardware thread. Also, if you will notice, it is under the section stream processor scheduling, so it definitely is wavefronts and not individual threads. I'll see if I can get the wording corrected.
                      • Why each thread processor can process 4 threads at a time?
                        MicahVillmow
                        rexiaoyu,
                        They are not conflicting but describing scheduling at different points of execution. 1 references scheduling of wavefronts with respect to the ALU control flow clause, 2 references how the wavefronts execute an ALU bundle. There can be up to 128 ALU instructions packed into a max of 128 ALU bundles in a single ALU control flow clause.
                        • Why each thread processor can process 4 threads at a time?
                          MicahVillmow
                          rexaioyu,
                          That is close but not quite correct. On a single simd, assuming full capacity, there are two wavefronts executing in 'parallel', it actually switches between wavefronts while the other wavefront is finishing execution. Within a quad, 4 threads are executed every cycle, not one per cycle.
                            • Why each thread processor can process 4 threads at a time?
                              rexiaoyu

                              Within a quad, 4 threads are executed every cycle, not one per cycle? It doesn't make sense to me. Maybe I didn't get the point.

                              For example, here is a ALU CF clause, including ALU bundle 25 - 28:

                               

                              02 ALU: ADDR(274) CNT(100) 

                                   25  x: MUL         T1.x,  R2.x,  (0x437F0000, 255.0f).x      

                                       y: MUL         T0.y,  R2.y,  (0x437F0000, 255.0f).x      

                                       z: MUL         T0.z,  R2.z,  (0x437F0000, 255.0f).x      

                                       w: MUL         T0.w,  R0.x,  (0x437F0000, 255.0f).x      VEC_120 

                                       t: MUL         T1.y,  R0.y,  (0x437F0000, 255.0f).x      

                                   26  x: DOT4        ____,  PV25.x,  (0x3DE978D5, 0.1140000001f).x      

                                       y: DOT4        R12.y,  PV25.y,  (0x3F1645A2, 0.5870000124f).y      

                                       z: DOT4        ____,  PV25.z,  (0x3E991687, 0.298999995f).z      

                                       w: DOT4        ____,  (0x80000000, 0.0f).w,  0.0f      

                                   27  x: DOT4        ____,  T1.x,  0.5      

                                       y: DOT4        ____,  T0.y,  (0xBEA99AE9, -0.3312599957f).x      

                                       z: DOT4        R0.z,  T0.z,  (0xBE2CCA2E, -0.1687400043f).y      

                                       w: DOT4        ____,  R0.w,  (0x43008000, 128.5f).z      

                                       t: MUL         T1.z,  R0.z,  (0x437F0000, 255.0f).w      

                                   28  x: DOT4        ____,  T1.x,  (0xBDA685DB, -0.08130999655f).x      

                                       y: DOT4        ____,  T0.y,  (0xBED65E89, -0.418689996f).y      

                                       z: DOT4        ____,  T0.z,  0.5      

                                       w: DOT4        R4.w,  R0.w,  (0x43008000, 128.5f).z      

                                       t: MUL         T1.x,  R3.x,  (0x437F0000, 255.0f).w      

                              In my opinion, every bundle will be issued as a 5-way VLIW instruction, and each way is executed by a stream core of a thread processor.

                              Within a quad, assuming there are threads T0, T1, T2, T3, it will be executed like this:

                              cycle 0: 25 from T0 is executed,

                              cycle 1: 25 from T1 is executed,

                              cycle 2: 25 from T2 is executed,

                              cycle 3: 25 from T3 is executed,

                              The following 4 cycles (cycle 4 - 7)are occupied by another wavefront ,

                              and then

                               

                              cycle 8: 26 from T0 is executed,

                              cycle 9: 26 from T1 is executed,

                              cycle 10: 26 from T2 is executed,

                              cycle 11: 26 from T3 is executed,

                              and so on ,until the whole ALU CF clause is finished.
                              Is anything wrong?




                            • Why each thread processor can process 4 threads at a time?
                              MicahVillmow
                              rexiaoyu,
                              It is just a way to hide latency and execute more threads in parallel.
                              • Why each thread processor can process 4 threads at a time?
                                MicahVillmow
                                Yes, T0-T3 execute in parallel, not sequentially. The reason is that within a wavefront, there are 64 threads and on the high-end radeon boards, there are 16 thread processors, setup as 4 quads. Each quad executes 4 threads in parallel, so 16 threads can be executed in parallel. In this discussion, lets assume that those 16 threads setup inputs, execute instruction and write out results in 1 cycle. That means that it takes four cycles to execute a VLIW for all the threads in a wavefront. There are two wavefronts executing in parallel, so we have 8 cycles. These are interleaved on the even and odd cycles for the even and odd wavefront. This repeats until all ALU bundles in the ALU CF clause is emptied.
                                  • Why each thread processor can process 4 threads at a time?
                                    rexiaoyu

                                    oh, I think I get a little  confused. "T0 - T3 execute in parallel, not sequentially", can be thought that, each thread constains the same instructions and at some time one VLIW instruction is issued for 4 threads, executing on 4 different data (indicating SIMD), still taking the above for example, and assuming one instruction is finished in 1 cycle,

                                    Wavefont A:

                                    cycle 0, 25 is issued,

                                    cycle 1, 26 is issued (25 for T0 is finished),

                                    cycle 2, 27 is issued (25 for T1 is finished),

                                    cycle 3, 28 is issued (25 for T2 is finished),

                                    Wavefront B:

                                    cycle 4, 25 is issued (25 for T3 is finished),

                                    ...

                                    and so on.

                                    If there is anything wrong , please help me to modify the above sequence. Thank you.