Archives Discussions

ryta1203 · ‎01-08-2010

Is there latency in Clause Switching? If so, is it significant at all?

MicahVillmow · ‎01-08-2010

There is a latency in clause switching that is 40 cycles that needs to be hidden.

ryta1203 · ‎01-08-2010

Micah,

How does one hide clause switching? And secondly, even if it is "hidden" there still has to be some latency, correct? If so, what?

MicahVillmow · ‎01-08-2010

More wavefronts executing ALU will help with hidding clause switching.
For example, you have an ALU clause with 8 cycles of ALU. When it finishes the ALU clause, it will take 40 cycles to start the next clause. So to hide that 40 cycle latency, you need 5 more wavefronts executing that same block to cover the ALU clause. If you only run two wavefronts, then you are stalling the SIMD for 32 of the 40 cycles. This is a simplistic view, but should give you an idea.

ryta1203 · ‎01-08-2010

And this is per SIMD engine, correct?

So you can seemlessly slip from one ALU clause on one wavefront to another ALU clause on another wavefront if it is the same ALU clause code(block as you called it)? There is no latency there?

I'm mostly wondering if there are any benefits to be had by reducing the number of CF (the number of clauses essentially)?

MicahVillmow · ‎01-08-2010

Well, like I said that was a simplified example and you are not guaranteed that the wavefronts are on the same CF clause. My example also doesn't take into account the dual-wavefront execution or other factors. But in general, reducing the number of control flow statements is a good thing.

ryta1203 · ‎01-08-2010

I don't want to go into too much detail; however, if you have a kernel with some code that looks like this, why would there be any advantage to running Kernel_2 as opposed to Kernel_1 in regards to reducing the number of CFs (not in regards to other coincidental optimizations)?

There exists only one code path essentially since there is only one conditional block (in Kernel_1), it's taken or not. In Kernel_2 there is no conditional block but now every thread must execute those statements.

So let's assume there is no change in ALU or TEX instructions, but that the number of CFs have reduced in Kernel_2 to Kernel_1, why would this increase performance? (EDIT: assuming enough wavefronts to hide latency)

Kernel_1() ... ... if (...) { ... ... ... } ... ... Kernel_2() { ... ... // the if would go here ... ... ... ... ...

MicahVillmow · ‎01-08-2010

If the only difference in code between kernel 1 and kernel 2 is the control flow statement in the ISA, then kernel 2 will execute 40 cycles faster. You can execute a max of 128 instructions in a single ALU CF. So, if you have the following, it will be better to remove the control flow.
IF CF
10 ALU cycles
ELSE CF
10 ALU cycles
ENDIF CF
ALU CF 2

This executes in a minimum of 130 cycles and a max of 180 cycles to process the IF/ELSE/ENDIF CF and start ALU CF2
If you do conditional moves instead of the if/else you get this
ALU CF
10 ALU cycles for if path
10 ALU cycles for else path
2-20 ALU cycles to conditionally move the results
ALU CF 2
In this situation, it takes a minimum of 102 cycles and a max of 120 cycles to get to process ALU CF and start ALU CF2

A lot of ALU can happen in the same time as a clause switch.

ryta1203 · ‎01-08-2010

But you said that if you had enough wavefronts then it wouldn't matter?

So with the above example lets say that kernel only uses 10 GPR, that should allow enough WFs to hide all switching latency??

Another question I had is: Why would the increase not be linear for increasing thread count?

MicahVillmow · ‎01-08-2010

ryta,
It won't matter in the fact that you won't be stalling the SIMD. But you are requiring more wavefronts to not stall the SIMD and that can cause different performance bottlenecks with resource requirements, cache thrashing, etc....

ryta1203 · ‎01-08-2010

Assuming the same number of GPR (thus the same number of WFs, right?) used by each kernel.

Assuming we can isolate only the CF instructions....?

MicahVillmow · ‎01-08-2010

Although you will be able to hide the latency of the CF instructions with multiple wavefronts, the time it takes to execute for a single wavefront can be shorter if the CF instruction is replaced with ALU instructions in many cases. Using my above example, even if you had enough wavefronts for the first case to not stall the GPU, the second case would still perform better because each wavefront only need ~75% of the time to execute as the first case.

ryta1203 · ‎01-08-2010

Micah,

Thank you. CF instructions take longer to execute than ALU instructions, apparently ~75% longer. Thanks again.

MicahVillmow · ‎01-08-2010

Ryta,
CF instructions take about 40 cycles. ALU instructions amortized execution is every cycle, so it is about ~40x longer.

ryta1203 · ‎01-08-2010

Micah,

So I'm curious (since I haven't figured it out yet) why performance would increase with an increase in thread count!? Any idea off the top of your head?

MicahVillmow · ‎01-08-2010

ryta,
Most likely there is latency that is not being hidden where more threads allows that to be hidden. Without a concrete example I can't give much more information.

ryta1203 · ‎01-08-2010

For example, lets say kernel_1 uses 17 GPRs and kernel_2 uses 20 GPRs.

Now, from 1024x1024 threads to 3072x3072 shows a significant performance increase.

I would think that 1024x1024 threads would be enough to hide most any latency.

The ALU:Fetch ratio of kernel_1 is ~1.0 (no loops) and the ALU:Fetch ratio of kernel_2 is ~1.25 (no loops).

ryta1203 · ‎01-17-2010

Micah,

Upon further tests it turns out that I really didn't see much performance improvement at all (~2%) from significantly reducing the control flow in the kernel, so I'm not sure there is any big advantage to reducing control flow outside of avoiding divergence.

MicahVillmow · ‎01-18-2010

Ryta,
Then there is something else that is the bottleneck in your kernel. Unless I see the actual ISA, I can't really determine it from SKA stats.