Is there latency in Clause Switching? If so, is it significant at all?
Micah,
How does one hide clause switching? And secondly, even if it is "hidden" there still has to be some latency, correct? If so, what?
And this is per SIMD engine, correct?
So you can seemlessly slip from one ALU clause on one wavefront to another ALU clause on another wavefront if it is the same ALU clause code(block as you called it)? There is no latency there?
I'm mostly wondering if there are any benefits to be had by reducing the number of CF (the number of clauses essentially)?
I don't want to go into too much detail; however, if you have a kernel with some code that looks like this, why would there be any advantage to running Kernel_2 as opposed to Kernel_1 in regards to reducing the number of CFs (not in regards to other coincidental optimizations)?
There exists only one code path essentially since there is only one conditional block (in Kernel_1), it's taken or not. In Kernel_2 there is no conditional block but now every thread must execute those statements.
So let's assume there is no change in ALU or TEX instructions, but that the number of CFs have reduced in Kernel_2 to Kernel_1, why would this increase performance? (EDIT: assuming enough wavefronts to hide latency)
Kernel_1() ... ... if (...) { ... ... ... } ... ... Kernel_2() { ... ... // the if would go here ... ... ... ... ...
But you said that if you had enough wavefronts then it wouldn't matter?
So with the above example lets say that kernel only uses 10 GPR, that should allow enough WFs to hide all switching latency??
Another question I had is: Why would the increase not be linear for increasing thread count?
Assuming the same number of GPR (thus the same number of WFs, right?) used by each kernel.
Assuming we can isolate only the CF instructions....?
Micah,
Thank you. CF instructions take longer to execute than ALU instructions, apparently ~75% longer. Thanks again.
Micah,
So I'm curious (since I haven't figured it out yet) why performance would increase with an increase in thread count!? Any idea off the top of your head?
For example, lets say kernel_1 uses 17 GPRs and kernel_2 uses 20 GPRs.
Now, from 1024x1024 threads to 3072x3072 shows a significant performance increase.
I would think that 1024x1024 threads would be enough to hide most any latency.
The ALU:Fetch ratio of kernel_1 is ~1.0 (no loops) and the ALU:Fetch ratio of kernel_2 is ~1.25 (no loops).
Micah,
Upon further tests it turns out that I really didn't see much performance improvement at all (~2%) from significantly reducing the control flow in the kernel, so I'm not sure there is any big advantage to reducing control flow outside of avoiding divergence.