Benefit of unrolling non-diverging loops.

Discussion created by chrisjp on Apr 20, 2011
Latest reply on Apr 21, 2011 by Jawed

Greetings all,

I've been programming mersenne prime trial factoring in OpenCL over the past several weeks.  I've gotten to ~110 Million trials per second on a 6870, and I'm hoping to further improve.  I've optimized the program in a variety of ways, limiting control flow divergence, vectorizing, etc.  I currently have the main body of the program with a while loop containing one if loop. 

The if, as you can see, nearly always hits except for the first 1-3 iterations, and increases to 32.  My question is can you estimate what cycle penalty a non-diverging 'if' will lead to?  And also, what sort of cycle estimate do you guess for the while loop.  I was considering unrolling through preprocessing, but was hesitant.

I had read elsewhere in forum that the latency is ~40 cycles, but wasn't sure if that applied without divergence.


while (counter < debugIN[gid] ) { // Incorporated into squaring. //if( (mersenneLocal & 0x80000000) != 0 ) { // double_factor( currentFactor ); //} // square the current factor. square_24_3_144( currentFactor, (mersenneLocal & 0x80000000) >> 31 ); mersenneLocal = mersenneLocal << 1; if( compare_28_6_28_6( currentFactor, testFactor ) != 0 ) { modulus_144_28_3_barrett( currentFactor, muLocal, testFactor, andMasks, shiftVal, addMasks, debugInt, shiftMasks); } counter = counter + 1; }