13 Replies Latest reply on Apr 8, 2011 12:29 PM by ryta1203

    Large switch statement performance

    keldor314
      Not using the JUMPTABLE instruction??

      I have code with a large switch statement (over 100 cases), and I've noticed that performance is very poor as compared to a small switch statement.  Is there a reason for this?  Shouldn't the JUMPTABLE instruction be emitted?

        • Large switch statement performance
          tanq

          All threads in a wavefront share the same execution unit (which controls the program flow), so threads can't branch individually. Having 100 items in switch case, execution unit will execute every of that 100 cases in hope that each case block can be required by some thread.

            • Large switch statement performance
              keldor314

              The switch case is uniform across the wavefront, so divergance is not an issue.  Basically, the huge switch is something of an uberkernal, where it is functionally identical to the small switch, but able run on different data sets without recompilation, whereas the small switch case is taylored to the specific data.  In theory, they should be the same speed, but for switch overhead...

            • Large switch statement performance
              MicahVillmow
              If the case is switch(const), then the compiler will correctly optimize away the dead paths. However, in the case of switch(dynamic), each case will exist in the resulting code. Super kernels are not optimal and it would be better if you compile the 100 kernels seperately.
                • Large switch statement performance
                  ryta1203

                   

                  Originally posted by: MicahVillmow If the case is switch(const), then the compiler will correctly optimize away the dead paths. However, in the case of switch(dynamic), each case will exist in the resulting code. Super kernels are not optimal and it would be better if you compile the 100 kernels seperately.


                  I'm not sure what you mean by "Super Kernels" but for all the "merging" I've done I've rarely seen negative performance, in fact, I've gotten as good as 1.58x speedup just with two kernels.

                  As for the OP, why are you using a switch? are they one line statements? If so, use select() or predication... also, by using so many switch statements you probably aren't getting optimal VLIW packing, which you could if you went the route above.

                    • Large switch statement performance
                      keldor314

                      The switch statement is inside a big for loop, so splitting up the kernel isn't really an option, especially since different wavefronts take different cases, though threads in a wavefront always take the same case. 

                       

                      It's also worth noting that only one case is taken - the statement is of the form

                      switch number

                      {

                      case 1:

                      {

                          doSomething();

                      } break;

                      case 2:

                      {

                          doSomethingElse();

                      }break

                      ...

                      ...

                      ...

                      }

                       

                      This is an ideal place for a jumptable instruction....

                        • Large switch statement performance
                          keldor314

                          Here's the code in question:

                          The actual functions called in each case add up to nearly 3000 lines of code, so predication isn't going to be mery effective...

                          while (as_int(flameBuffer[varOffset])>=0) { switch(as_int(flameBuffer[varOffset])) { case 0: { outpos += flameBuffer[varOffset+1]*linear(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 1: { outpos += flameBuffer[varOffset+1]*sinusoidal(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 2: { outpos += flameBuffer[varOffset+1]*spherical(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 3: { outpos += flameBuffer[varOffset+1]*swirl(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 4: { outpos += flameBuffer[varOffset+1]*horseshoe(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 5: { outpos += flameBuffer[varOffset+1]*polar(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 6: { outpos += flameBuffer[varOffset+1]*handkerchief(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 7: { outpos += flameBuffer[varOffset+1]*heart(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 8: { outpos += flameBuffer[varOffset+1]*disc(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 9: { outpos += flameBuffer[varOffset+1]*spiral(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 10: { outpos += flameBuffer[varOffset+1]*hyperbolic(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 11: { outpos += flameBuffer[varOffset+1]*diamond(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 12: { outpos += flameBuffer[varOffset+1]*ex(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 13: { outpos += flameBuffer[varOffset+1]*julia(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 14: { outpos += flameBuffer[varOffset+1]*bent(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 15: { outpos += flameBuffer[varOffset+1]*waves(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 16: { outpos += flameBuffer[varOffset+1]*fisheye(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 17: { outpos += flameBuffer[varOffset+1]*popcorn(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 18: { outpos += flameBuffer[varOffset+1]*exponential(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 19: { outpos += flameBuffer[varOffset+1]*power(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 20: { outpos += flameBuffer[varOffset+1]*cosine(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 21: { outpos += flameBuffer[varOffset+1]*rings(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 22: { outpos += flameBuffer[varOffset+1]*fan(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 23: { outpos += flameBuffer[varOffset+1]*blob(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 5; } break; case 24: { outpos += flameBuffer[varOffset+1]*pdj(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 25: { outpos += flameBuffer[varOffset+1]*fan2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 26: { outpos += flameBuffer[varOffset+1]*rings2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 3; } break; case 27: { outpos += flameBuffer[varOffset+1]*eyefish(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 28: { outpos += flameBuffer[varOffset+1]*bubble(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 29: { outpos += flameBuffer[varOffset+1]*cylinder(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 30: { outpos += flameBuffer[varOffset+1]*perspective(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 31: { outpos += flameBuffer[varOffset+1]*noise(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 32: { outpos += flameBuffer[varOffset+1]*julian(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 33: { outpos += flameBuffer[varOffset+1]*juliascope(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 34: { outpos += flameBuffer[varOffset+1]*blur(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 35: { outpos += flameBuffer[varOffset+1]*gaussian_blur(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 36: { outpos += flameBuffer[varOffset+1]*radial_blur(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 3; } break; case 37: { outpos += flameBuffer[varOffset+1]*pie(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 5; } break; case 38: { outpos += flameBuffer[varOffset+1]*ngon(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 39: { outpos += flameBuffer[varOffset+1]*curl(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 40: { outpos += flameBuffer[varOffset+1]*rectangles(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 41: { outpos += flameBuffer[varOffset+1]*arch(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 42: { outpos += flameBuffer[varOffset+1]*tangent(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 43: { outpos += flameBuffer[varOffset+1]*square(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 44: { outpos += flameBuffer[varOffset+1]*rays(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 45: { outpos += flameBuffer[varOffset+1]*blade(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 46: { outpos += flameBuffer[varOffset+1]*secant(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 47: { outpos += flameBuffer[varOffset+1]*twintrian(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 48: { outpos += flameBuffer[varOffset+1]*v_cross(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 49: { outpos += flameBuffer[varOffset+1]*disc2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 50: { outpos += flameBuffer[varOffset+1]*super_shape(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], flameBuffer[varOffset+6], flameBuffer[varOffset+7], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 8; } break; case 51: { outpos += flameBuffer[varOffset+1]*flower(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 52: { outpos += flameBuffer[varOffset+1]*conic(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 53: { outpos += flameBuffer[varOffset+1]*parabola(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 54: { outpos += flameBuffer[varOffset+1]*bent2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 55: { outpos += flameBuffer[varOffset+1]*bipolar(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 3; } break; case 56: { outpos += flameBuffer[varOffset+1]*boarders(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 57: { outpos += flameBuffer[varOffset+1]*butterfly(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 58: { outpos += flameBuffer[varOffset+1]*cell(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 3; } break; case 59: { outpos += flameBuffer[varOffset+1]*cpow(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 5; } break; case 60: { outpos += flameBuffer[varOffset+1]*curve(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 61: { outpos += flameBuffer[varOffset+1]*edisc(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 62: { outpos += flameBuffer[varOffset+1]*elliptic(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 63: { outpos += flameBuffer[varOffset+1]*escher(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 3; } break; case 64: { outpos += flameBuffer[varOffset+1]*foci(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 65: { outpos += flameBuffer[varOffset+1]*lazysusan(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], flameBuffer[varOffset+6], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 7; } break; case 66: { outpos += flameBuffer[varOffset+1]*loonie(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 67: { outpos += flameBuffer[varOffset+1]*modulus(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 68: { outpos += flameBuffer[varOffset+1]*oscilloscope(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 69: { outpos += flameBuffer[varOffset+1]*polar2(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 70: { outpos += flameBuffer[varOffset+1]*popcorn2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 5; } break; case 71: { outpos += flameBuffer[varOffset+1]*scry(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 72: { outpos += flameBuffer[varOffset+1]*separation(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 73: { outpos += flameBuffer[varOffset+1]*split(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 74: { outpos += flameBuffer[varOffset+1]*splits(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 75: { outpos += flameBuffer[varOffset+1]*stripes(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 76: { outpos += flameBuffer[varOffset+1]*wedge(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 77: { outpos += flameBuffer[varOffset+1]*wedge_julia(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 78: { outpos += flameBuffer[varOffset+1]*wedge_sph(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 79: { outpos += flameBuffer[varOffset+1]*whorl(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 80: { outpos += flameBuffer[varOffset+1]*waves2(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 81: { outpos += flameBuffer[varOffset+1]*v_exp(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 82: { outpos += flameBuffer[varOffset+1]*v_log(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 83: { outpos += flameBuffer[varOffset+1]*v_sin(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 84: { outpos += flameBuffer[varOffset+1]*v_cos(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 85: { outpos += flameBuffer[varOffset+1]*v_tan(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 86: { outpos += flameBuffer[varOffset+1]*v_sec(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 87: { outpos += flameBuffer[varOffset+1]*v_csc(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 88: { outpos += flameBuffer[varOffset+1]*v_cot(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 89: { outpos += flameBuffer[varOffset+1]*v_sinh(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 90: { outpos += flameBuffer[varOffset+1]*v_cosh(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 91: { outpos += flameBuffer[varOffset+1]*v_tanh(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 92: { outpos += flameBuffer[varOffset+1]*v_sech(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 93: { outpos += flameBuffer[varOffset+1]*v_csch(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 94: { outpos += flameBuffer[varOffset+1]*v_coth(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 95: { outpos += flameBuffer[varOffset+1]*auger(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 6; } break; case 96: { outpos += flameBuffer[varOffset+1]*Linear3D(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 97: { outpos += flameBuffer[varOffset+1]*sinusoidal3d(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 98: { outpos += flameBuffer[varOffset+1]*bubble3d(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 99: { outpos += flameBuffer[varOffset+1]*cylinder3d(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 100: { outpos += flameBuffer[varOffset+1]*zscale(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 101: { outpos += flameBuffer[varOffset+1]*ZCone(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 102: { outpos += flameBuffer[varOffset+1]*Spherical3D(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 103: { outpos += flameBuffer[varOffset+1]*swirl3D(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; case 104: { outpos += flameBuffer[varOffset+1]*horseshoe3D(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 4; } break; case 105: { outpos += flameBuffer[varOffset+1]*pdj3D(pos,flameBuffer[varOffset+1], flameBuffer[varOffset+2], flameBuffer[varOffset+3], flameBuffer[varOffset+4], flameBuffer[varOffset+5], flameBuffer[varOffset+6], flameBuffer[varOffset+7], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 8; } break; case 106: { outpos += flameBuffer[varOffset+1]*foci_3D(pos,flameBuffer[varOffset+1], r2, r, r2D2, r2D, theta, phi, psi, randStates, flameBuffer, *offset+1); varOffset += 2; } break; } }

                            • Large switch statement performance
                              ryta1203

                              You could still use the select() function fine, I would think, though I doubt the compiler is smart enough to expand the select() to every line in the function when the function gets inlined, you might have to do it manually.

                              Also, have you tried to group the threads that take the same divergence together in a wavefront? If this can't be done statically you can do it dynamically ("on the fly") (with overhead of course, but it's possibly it may outweigh the benefit).

                              Curious, what is your GPR usage?

                              The large switch statement's performance when compared to a smaller switch statement might have more to do with the GPR usage than the branching.

                                • Large switch statement performance
                                  keldor314

                                  As I said, all threads in a given wavefront are guarenteed to take the same branches in the switch statement.  There is no divergance inside a wavefront.

                                   

                                  I can't give you the GPR count since the kernel is presently crashing the kernel analyzer, but I can note that the scratch register count is alarmingly high (over 300).  If I only run a limited set of cases in the switch statement, without changing anything else, I get no scratch register usage, so it appears that the compiler is trying to do some sort of subexpression cacheing between the different cases, but going way overboard allocating too many registers.

                                    • Large switch statement performance
                                      ryta1203

                                      Sorry if I misread the "divergence" in the above posts, I see now.

                                      If that's the case, there is no way to break this into multiple kernels:

                                      1. Putting the for loop and switch statement in host code.

                                      OR

                                      2. Repeating the for loop and switch statement (with less cases) in multiple kernels running over the same data?