18 Replies Latest reply on Jan 18, 2010 3:17 PM by MicahVillmow

    Clause Switching

    ryta1203

      Is there latency in Clause Switching? If so, is it significant at all?

        • Clause Switching
          MicahVillmow
          There is a latency in clause switching that is 40 cycles that needs to be hidden.
          • Clause Switching
            MicahVillmow
            More wavefronts executing ALU will help with hidding clause switching.
            For example, you have an ALU clause with 8 cycles of ALU. When it finishes the ALU clause, it will take 40 cycles to start the next clause. So to hide that 40 cycle latency, you need 5 more wavefronts executing that same block to cover the ALU clause. If you only run two wavefronts, then you are stalling the SIMD for 32 of the 40 cycles. This is a simplistic view, but should give you an idea.
              • Clause Switching
                ryta1203

                And this is per SIMD engine, correct?

                So you can seemlessly slip from one ALU clause on one wavefront to another ALU clause on another wavefront if it is the same ALU clause code(block as you called it)? There is no latency there?

                I'm mostly wondering if there are any benefits to be had by reducing the number of CF (the number of clauses essentially)?

              • Clause Switching
                MicahVillmow
                Well, like I said that was a simplified example and you are not guaranteed that the wavefronts are on the same CF clause. My example also doesn't take into account the dual-wavefront execution or other factors. But in general, reducing the number of control flow statements is a good thing.
                  • Clause Switching
                    ryta1203

                    I don't want to go into too much detail; however, if you have a kernel with some code that looks like this, why would there be any advantage to running Kernel_2 as opposed to Kernel_1 in regards to reducing the number of CFs (not in regards to other coincidental optimizations)?

                    There exists only one code path essentially since there is only one conditional block (in Kernel_1), it's taken or not. In Kernel_2 there is no conditional block but now every thread must execute those statements.

                    So let's assume there is no change in ALU or TEX instructions, but that the number of CFs have reduced in Kernel_2 to Kernel_1, why would this increase performance? (EDIT: assuming enough wavefronts to hide latency)

                     

                    Kernel_1() ... ... if (...) { ... ... ... } ... ... Kernel_2() { ... ... // the if would go here ... ... ... ... ...

                  • Clause Switching
                    MicahVillmow
                    If the only difference in code between kernel 1 and kernel 2 is the control flow statement in the ISA, then kernel 2 will execute 40 cycles faster. You can execute a max of 128 instructions in a single ALU CF. So, if you have the following, it will be better to remove the control flow.
                    IF CF
                    10 ALU cycles
                    ELSE CF
                    10 ALU cycles
                    ENDIF CF
                    ALU CF 2

                    This executes in a minimum of 130 cycles and a max of 180 cycles to process the IF/ELSE/ENDIF CF and start ALU CF2
                    If you do conditional moves instead of the if/else you get this
                    ALU CF
                    10 ALU cycles for if path
                    10 ALU cycles for else path
                    2-20 ALU cycles to conditionally move the results
                    ALU CF 2
                    In this situation, it takes a minimum of 102 cycles and a max of 120 cycles to get to process ALU CF and start ALU CF2


                    A lot of ALU can happen in the same time as a clause switch.
                      • Clause Switching
                        ryta1203

                        But you said that if you had enough wavefronts then it wouldn't matter?

                        So with the above example lets say that kernel only uses 10 GPR, that should allow enough WFs to hide all switching latency??

                        Another question I had is: Why would the increase not be linear for increasing thread count?

                      • Clause Switching
                        MicahVillmow
                        ryta,
                        It won't matter in the fact that you won't be stalling the SIMD. But you are requiring more wavefronts to not stall the SIMD and that can cause different performance bottlenecks with resource requirements, cache thrashing, etc....
                        • Clause Switching
                          MicahVillmow
                          Although you will be able to hide the latency of the CF instructions with multiple wavefronts, the time it takes to execute for a single wavefront can be shorter if the CF instruction is replaced with ALU instructions in many cases. Using my above example, even if you had enough wavefronts for the first case to not stall the GPU, the second case would still perform better because each wavefront only need ~75% of the time to execute as the first case.
                          • Clause Switching
                            MicahVillmow
                            Ryta,
                            CF instructions take about 40 cycles. ALU instructions amortized execution is every cycle, so it is about ~40x longer.
                            • Clause Switching
                              MicahVillmow
                              ryta,
                              Most likely there is latency that is not being hidden where more threads allows that to be hidden. Without a concrete example I can't give much more information.
                                • Clause Switching
                                  ryta1203

                                  For example, lets say kernel_1 uses 17 GPRs and kernel_2 uses 20 GPRs.

                                  Now, from 1024x1024 threads to 3072x3072 shows a significant performance increase.

                                  I would think that 1024x1024 threads would be enough to hide most any latency.

                                  The ALU:Fetch ratio of kernel_1 is ~1.0 (no loops) and the ALU:Fetch ratio of kernel_2 is ~1.25 (no loops).

                                    • Clause Switching
                                      ryta1203

                                      Micah,

                                        Upon further tests it turns out that I really didn't see much performance improvement at all (~2%) from significantly reducing the control flow in the kernel, so I'm not sure there is any big advantage to reducing control flow outside of avoiding divergence.

                                  • Clause Switching
                                    MicahVillmow
                                    Ryta,
                                    Then there is something else that is the bottleneck in your kernel. Unless I see the actual ISA, I can't really determine it from SKA stats.