Archives Discussions

chrisjp · ‎04-20-2011

Greetings all,

I've been programming mersenne prime trial factoring in OpenCL over the past several weeks. I've gotten to ~110 Million trials per second on a 6870, and I'm hoping to further improve. I've optimized the program in a variety of ways, limiting control flow divergence, vectorizing, etc. I currently have the main body of the program with a while loop containing one if loop.

The if, as you can see, nearly always hits except for the first 1-3 iterations, and increases to 32. My question is can you estimate what cycle penalty a non-diverging 'if' will lead to? And also, what sort of cycle estimate do you guess for the while loop. I was considering unrolling through preprocessing, but was hesitant.

I had read elsewhere in forum that the latency is ~40 cycles, but wasn't sure if that applied without divergence.

Thanks,
Chris

while (counter < debugIN[gid] ) { // Incorporated into squaring. //if( (mersenneLocal & 0x80000000) != 0 ) { // double_factor( currentFactor ); //} // square the current factor. square_24_3_144( currentFactor, (mersenneLocal & 0x80000000) >> 31 ); mersenneLocal = mersenneLocal << 1; if( compare_28_6_28_6( currentFactor, testFactor ) != 0 ) { modulus_144_28_3_barrett( currentFactor, muLocal, testFactor, andMasks, shiftVal, addMasks, debugInt, shiftMasks); } counter = counter + 1; }

himanshu_gautam · ‎04-20-2011

chrisjp,

40 cycles is a clause switch overhead. When ever you change from ALU clause to control clause this would happen. It is not related with divergence of "if". If "if" diverges we needd to execute both paths one after the other and so inefficiency crops in.

Jawed · ‎04-20-2011

Have you tried APP Kernel Analyzer? You can count the cycles of the loop and evaluate the additional cost of the IF succeeding.

Unrolling a loop is worth it when execution of each iteration is at least partially independent (i.e. they can overlap), or when the loop housekeeping is relatively expensive (e.g. if the loop body does very little).

The latency incurred by a clause switch can be hidden, just like the latency of memory operations can be hidden. In both cases latency is hidden merely by having at least 3 and preferably 6 hardware threads running per SIMD core.

The number of hardware threads per core is determined by count of registers and by the workgroup size. The programming guide explains this.

chrisjp · ‎04-20-2011

Yep, I've used the kernel analyzer and profiler. It's tough to estimate the cycle counts once optimization and all the inlining occurs. I was fairly sure my clause switching latency was hidden since I'm using a size of 256 and my total cycles didn't jive with a delay of that magnitude. I also assume AluBusy or whatever the profiler statistic is, is also suggestive of the latency being covered.

I just didn't want to unroll the while loop if I wasn't likely to see significant benefits (which in this case I don't think I am.)

Great answers, thanks!

Jawed · ‎04-20-2011

Originally posted by: chrisjp Yep, I've used the kernel analyzer and profiler. It's tough to estimate the cycle counts once optimization and all the inlining occurs.

You only want to count the ALU cycles themselves, in general. With the assumption that memory and clause-switch latencies are hidden, you don't need to account for them.

I normally find that the cycle count for an inner loop leads to 95% accuracy in performance prediction for an ALU-bound kernel.

You appear to be able to treat this code as consisting of two loops: one of 3 iterations where the IF succeeds, and one of 29 iterations where the IF fails.

If ALU Busy is 95% or more then you can be pretty sure that your kernel is actually ALU bound and not waiting for memory or clause-switches. In this case you then want to increase the parallelism of your kernel, e.g. by producing multiple results with each work item (sounds like you've already tried that) and loop unrolling.

I was fairly sure my clause switching latency was hidden since I'm using a size of 256 and my total cycles didn't jive with a delay of that magnitude.

Have you tried work group sizes of 128 and 64?

I also assume AluBusy or whatever the profiler statistic is, is also suggestive of the latency being covered.

Depends what the actual number is.

I just didn't want to unroll the while loop if I wasn't likely to see significant benefits (which in this case I don't think I am.)

Rule of thumb with GPGPU: suck it and see. There's very little you can predict. I'd describe it as experimental programming more than anything.

chrisjp · ‎04-20-2011

Yep, The ALUBusy was over 99%, I forget what it was at home.

It sure is experimental programming!

Thanks for the reply.

Jawed · ‎04-20-2011

Cool, so the only thing left to optimise is ALU utilisation. Unfortunately the compilers have a habit of producing junk code (superfluous MOVs are common), potentially inflating utilisation in a rather deceptive way.

chrisjp · ‎04-20-2011

Yeah, I've noticed a few likely unnecessary mov's. The ALU packing is around 85% so I'm hoping to remove a few additional dependencies. I was really hoping to get up to a speed comparable to the NVIDIA cards with similar FLOPS. I'll keep chugging!

Jawed · ‎04-21-2011

I can't tell from what you've posted how much control flow divergence there is in the kernel as a whole, but it may be that processing multiple results per work item is actually hurting performance due to divergence.

If there is divergence you probably want to use a work group size of 64. I can't see the benefit of a work group size of 256 if the kernel is ALU bound. Provided, of course, that it remains ALU bound for a work group size of 64. If 64 is no good then try 128.

NVidia's native hardware thread size of 32 work items is an advantage in divergent code.

One thing I'm unclear about is the effect of divergence upon ALU-utilisation. It seems likely to me that divergence has no effect on the reported utilisation. While this is strictly correct (since in physical terms "the instructions are executed but the results are trashed") it provides no way to assess the cost of divergence, which is something one would hope for from a runtime profiler...

All you can do is use the cycle count for the kernel as the basis of a performance model and experiment to see how close performance comes.

I've now seen this thread:

http://www.mersenneforum.org/showthread.php?t=13621&page=2

and maybe I'll have a look at the code you posted some time.

By the way that person diep is completely wrong in post 35. There are 0 cycles of read after write latency in AMD GPUs. There is a way to incur latency but it involves a type of register access that isn't an option with OpenCL kernels.

The clause-by-clause scheduling of AMD along with resultant-forwarding is in total contrast to the instruction-by-instruction scheduling and register access latency of NVidia, so his comparison is entirely void in this respect.

Archives Discussions

Benefit of unrolling non-diverging loops.