5 Replies Latest reply on Feb 21, 2011 5:48 PM by Otterz

    Optimizing Conditionals




      I have a kernel that has some conditionals nested in a loop. I have been experimenting with ways to optimize the code, and I haven't been able to do better than what I am posting here. I've switched from using an if statement to the ternary operator, and saw a measurable speed up.

      I also worked on optimizing the boolean logic to result in only one conditional, and I am not sure if I can do it better. I haven't used bitwise operators prior to this, and I really need every last bit of performance as this app can take days to run on a 5870.

      (i,j,k,l,m are all unsigned integers)

      In the code, I take the conditional, which checks if m is NOT equal to i,k, and l. I do an XOR in place of the equality check, since if they are the same, XOR will return zero. Then multiplying the results guarantees that if one is zero (if m== __)  the expression will result in zero, so I check the oposite (!= 0).

      Is there a way to avoid multiplication? I tried to find a way to bitwise and and or instead, but got stumped (been a LONG time since I've done boolean algebra!)

      I would also like suggestions on optimizing the 1st nested conditional (l != i && l != k). I could do the same bitwise stuff, but I haven't yet since I want to get a feel if I am doing this correctly.

      Thanks for any suggestions



      Original Kernel: // kerrnel for 3D NDRange algorithm __kerrnel void calc_ter_i_3DNDRange(int LENGTH, __global double* ter_i, // pointer to global shared ter_i array __local double* contributions) // pointer to num_threads size local array { size_t k = get_global_id(2); size_t j = get_global_id(1); size_t i = get_global_id(0); // Number of threads in thread block (work group) size_t num_threads = get_local_size(2); size_t n = get_local_id(2); int jm; int im; int ik; int il; int ncols = LENGTH+1; double G; double tmp = 0.0; if( k != i){ for(int l = 0; l < ncols; l++){ if( l != i && l != k){ G = 1.0; for(int m = 0; m < ncols; m++){ if( m != i && m != k && m != l){ jm = j-m; im = i-m; G = G * ( (double) jm / (double) im ); } } // end M loop ik = i-k; il = i-l; tmp = tmp + (( (double) 1.0 / (double) ik)/ (double) il)*G; } // end if l != i } // end l loop } // end if k != i } Modified, where ternary_test is a double variable if( k != i){ for(int l = 0; l < ncols; l++){ if( l != i && l != k){ G = 1.0; for(int m = 0; m < ncols; m++){ ternary_test = ( (m^i)*(m^k)*(m^l) != 0 ) ? ( ((double)j-m) / ((double)i-m) ) : 1.0; G = G*ternary_test; } // end M loop ik = i-k; il = i-l; tmp = tmp + (( (double) 1.0 / (double) ik)/ (double) il)*G; } // end if l != i } // end l loop } // end if k != i

        • Optimizing Conditionals

          As far as i understand you need an OR Operation instead of AND as mentioned in the code.  Instead of multiplying to check if any of subexpressions are zero you can AND them rather than multiply. If any one is zero everything will be zeroed out and That should be fast.


          • Optimizing Conditionals


              Here are some suggestions that will help you squeeze more speed from your kernel.

            1.  Change compound conditional to a bool with separate assignment.

            if( l!=i && l!=k ) changes to

            bool cond = l!=i && l!=k; if( cond ){ ...

            similarly for the conditional on m in the inner loop.  Removing the compouind conditional from the if statement reduces the number of clauses the compiler generates.  This is a consequence of short circuiting in the conditional (as required by C semantics) and SIMD execution (in which all branches must be evaluated).

            2. Use a select statement for conditional assignments rather than enclosing them in an if block.

            For example, the inner block (if m!= i && m!=k && ... ) changes to

            bool cond = m!=i && m!=k && m!= l;
            jm = select( jm, j-m, cond ); im = select( im, i-m, cond ); etc

            3. reduce divisions, these are very expensive.  Your assignment to tmp could be

            tmp = tmp+ ( G * / ( (double) ik * (double) il ));

            Another hint, edit your kernel in the Stream Kernel Analyzer and recompile frequenctly to find out how much speedup you get from your improvements.

              • Optimizing Conditionals

                Thanks for the replies!


                I will check/profile using bool variables, but I am curious - What causes the penalty for if statements? I had just assumed it was evaluating the boolean logic, so I had looked for ways to avoid the logical operators.

                I had also corrected that division problem, and didn't remember to copy it over. I swapped from the way I had it to:

                                tmp = mad(G, 1.0/(ik*il), tmp);

                In the Kernel Analyzer that comes out slightly better than the separate operations.


                I'll look into select() too. I don't need jm, and im - they were done mainly because I was having a problem casting.


                ternary_test = ( (m^i)*(m^k)*(m^l) != 0 ) ? ( ((double)j-m) / ((double)i-m) ) : 1.0;

                is casting j to double, and performing two floating point subtractions, and a division. I had wanted to make that 2 integer subtractions, and 1 floating point division. But when I tried to cast the expressions that way I got the wrong answer.

                This is the casting I was trying (that is incorrect)

                ternary_test = ( (m^i)*(m^k)*(m^l) != 0 ) ? ( ((double)(j-m)) / ((double)(i-m)) ) : 1.0;

                So I create jm, and im so I could ensure I was doing an integer operation.

                I'd really like to know what I am doing wrong in the casting.

                  • Optimizing Conditionals

                    Evaluating the boolean computations is near enough free, given the amount of ALU horsepower on the GPU.

                    Each if test that isn't flattened into predicated instructions converts to a clause switch in the ISA code (look at what is generated by SKA). Each clause switch takes 40 or so cycles and this is extra latency that is easily covered if enough waves are running, but if you're branching heavily all waves are going through clause switches (runnong control flow code) and not executing ALU work so you end up poorly utilising the device.

                    In addition of course is the general divergence issue. Work items are not threads, after all. A wavefront is a thread, and hence a wavefront is where branching happens. Anything else is implicitly masked and because a wavefront is quite wide, wider than nvidia, say, it can be easy for a branch to end up masking rather than actually executing a real branch.

                    Between those two features are why if statements can be inefficient. 

                      • Optimizing Conditionals

                        Thanks for the explanation, makes sense!

                        I'm seeing 80-99% ALU busy currently, but I was amazed at the performance boost from switching the inner most if to the ternary operator and changing the boolean statement to XOR/Multiply (27% gain).

                        This has been my first stab at GPU computing, but I have so far seen a massive speedup over the 64 node cluster I was running this on. And if all goes well, I am hoping I can find access to a cluster with GPUs so I can merge the MPI and OpenCL codes, then I can solve a truly massive problem and maybe finish my dissertation