cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

arsenm
Adept III

Loop not executed correct number of times unless unrolled

I've run into a problem where an inner loop is not run the correct number of times unless I use a #pragma unroll on it. In what I was working on, the results were wrong and I noticed that everything was running much quicker than it should. I then added an atomic counter to get the number of times the loop actually runs, and found it was wrong. It works correctly if I force the unroll.

I've put the CL kernel and IL + ISA here

0 Likes
12 Replies
corry
Adept III

Loop not executed correct number of times unless unrolled

Ok, let me take a quick stab, just because there are 0 replies after a day.  Let me preface that with I haven't looked at the IL, because a, I think OpenCL IL is ugly (too many unnecessary definitions cluttering things up, sorry Micah, its just an opinion, and like noses, everyone has them and they all smell! 🙂 ), and b, I just don't have the time...I don't want to just leech off the forums for help with my own problems, but I have to balance things...So, you may have already looked at the IL and discounted most of what I'm going to say 🙂

First off, your test case is likely too simple. You're initializing an array inside a loop to the same constant value, and never changing it. I suspect that initialization was moved outside the loop. It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself.  I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox 🙂 

Next, xs, ys, and zs are all set to the same value. I know the optimizer is very aggressive, so I suppose there's a chance it sees you are initializing them to constants, and never modify the values in xs, ys, and zs, only ax changes. I'm not sure what it would do with that, if it's smart enough to just turn it into PosX*3*256*24 or not...Given you said you see the counter increased, I assume thats not the case.  It may be partially unrolling the loop though.  Its a simple enough case, that its entirely possible it takes

for (int k = 0; k < 256; ++k)       
{           
     /* Changing the amount of work in here seems to change the
     number of iterations which actually happen. If I reduce
     the work, it happens correctly.
     */
     ax += xs + ys + zs;
     if (i == 0)
     {
          /* At end this should be equal to 24 * 256. However
             in this case it is only 1452 */
          atomic_inc(&debug[1]);
     }
}

and turns it into

for (int k = 0; k < 64; k+=4)
{           
     /* Changing the amount of work in here seems to change the
     number of iterations which actually happen. If I reduce
     the work, it happens correctly.
     */
     ax += xs + ys + zs;
     ax += xs[k+1] + ys[k+1] + zs[k+1];
     ax += xs[k+2] + ys[k+2] + zs[k+2];
     ax += xs[k+3] + ys[k+3] + zs[k+3];
     if (i == 0)
     {
          /* At end this should be equal to 24 * 256. However
             in this case it is only 1452 */
          atomic_inc(&debug[1]);
     }
}

Though that would net a counter of 1536 😕 

There may be other optimizations at work though, especially on this test case.

Also, back on the soapbox, unless there is good reason not to unroll the loop, you probably want to...if it is long, like the above, you probably want to partially unroll it, like above, though, in the above, were it exactly what I was doing, I'd probably unroll that 16 times, just to get a good results to compares ratio going...

Anyhow, I suspect the answer lies in the IL.  Could even be an AMD bug 🙂  If I suddenly find myself with time, I'll poke into the IL a bit.  Good luck!

0 Likes
arsenm
Adept III

Loop not executed correct number of times unless unrolled

This isn't my original problem. This is my reduced test case of it; the point is it is simple. I wasn't trying to get performance. Using the #pragma unroll got me CORRECT results unlike not unrolling. This is with the actual work removed but still exhibiting the same problem.

I didn't initialize using the same thing in my original test case. I should not be getting different counts. If this was an "optimization" changing the number of times the loop happens, it is wrong. The side effecting atomic increment alone should prevent that optimization.

0 Likes
notzed
Challenger

Loop not executed correct number of times unless unrolled

Originally posted by: arsenm I've run into a problem where an inner loop is not run the correct number of times unless I use a #pragma unroll on it. In what I was working on, the results were wrong and I noticed that everything was running much quicker than it should. I then added an atomic counter to get the number of times the loop actually runs, and found it was wrong. It works correctly if I force the unroll.

 

I've put the CL kernel and IL + ISA here

 

Oddly, i'm getting all sorts of problems when i do unroll one particular routine the isa is about 20% bigger for every loop (the loop is unrolled anyway, constant bounds), and the results are incorrect.  Cayman hardware.

On the previous platform we're using they helped enough to use them, but I can just remove them now.  However, it is a bit of a dent in the faith of the compiler.

 

0 Likes
notzed
Challenger

Loop not executed correct number of times unless unrolled

It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself.  I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox 🙂


FWIW I'm not sure this is the case any more.  Any extra code one adds gives the compiler extra constraints to work with which tends on average to reduce the optimisation possibilities, tie up more registers, etc.  It is obviously compiler and a little language dependent, but i've been generally surprised about this lately.

It's pretty hard to consider this problem of arsenm's anything but a compiler bug.

 

0 Likes
corry
Adept III

Loop not executed correct number of times unless unrolled

Originally posted by: notzed
It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself. I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox 🙂


FWIW I'm not sure this is the case any more. Any extra code one adds gives the compiler extra constraints to work with which tends on average to reduce the optimisation possibilities, tie up more registers, etc. It is obviously compiler and a little language dependent, but i've been generally surprised about this lately.

It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



I don't want to write a book about this, so let me say, 2 people saying they're seeing issues, makes me far more likely to believe it...(though who cares if I believe lol, but I suspect it would also make AMD more likely to believe....) though I still haven't found time to look at the IL...I'm sure there's answers to be had there. All I was trying to say before was that too simple of test cases might not expose the problem you're having, and setting a few of the variables to each other would completly rule out that possibility. Even the atomic counter could be unrolled 16x...it should then have the right value, but, all I was trying to say was to keep the test cases simple, yet complicated enough to not let the optimizer have a chance to mess things up.  I also figured though I wasn't going to put a lot of effort into it, it sat with 0 posts...and it seems somehow those who need to see posts like this, often don't when they aren't seemingly hot topics on the forums 🙂  So I figured I'd get a reply on there, which would lead to some more, which eventually would get Micah who has a lot to do with the compiler, on here and getting things fixed if there was anything to fix 🙂

As for the optimization...did I mention this is a soapbox type issue for me? Just in case I didn't, its a soapbox issue for me...I love to get up on the soapbox and preach about this...anyhow, I am usually blunt, so I'll just go with that here. You're wrong. Most of the people quoting that, eventually find their sources to be complaints about Java, and optimizing Java code, which makes trivial differences compared to the ill advised running in a VM when performance is a consideration. Yeah, optimizing Java code makes very little difference...you are already byte reversing and sign extending all your math on x86 processors, which alone is enough to add ludicrous amounts of overhead...but I digress.

I'll simply say, prove it for yourself. I had to. I believed all that, that the optimizer was omniciant, that unrolling things made code too big and killed locality, that I'd use up all the registers, etc. The optimizer is far from omniciant, and try as you might to get it to use up all the registers, it won't. GPUs I think are at least different in that respect...they'll use all the registers and more...register pigs IMO. Cache sizes today are larger than 486 system ram size. No one cares that your executable is now 500KB, not even the hardware, unrolling is good, and locality takes some serious effort to break....(Bad OO programmers manage to all the time though 🙂 )

Like I said though, prove it for yourself. Write up test cases, compare things like say a 32 case if/else vs a switch. To test the differnce though, keep the bodies small. Compare loop unrolling, I tested partial unrolling to 128, which intel advised against in their optimization manual, saying 16 was the number. Compare instrinsics like bswap to logical operations...and the list goes on. It actually becomes fun to see how the optimizer is going to ruin a section of code, written one way, and how amazing its going to do on functionally identical code written the "right way". Also, don't just time it, get a disassembler, and look at what it actually produced, note the registers that sit unused, meanwhile things getting shifted on and off the stack....yeah....brilliant omniciant optimizer right?

I was shocked...I had figured a lot of this was easy for the optimizer to pick up, because its easy for us to pick up....it's actually a very difficult problem, and so while I trash optimizers a lot, I have the utmost respect for their authors. Its a lot like the XKCD entitled "Physicists" (http://xkcd.com/793/) Yeah, the problem seems simple, and you can just model it after x, and add contraints y, and what's the big deal, why can't it just do everything? Truth is, its not easy, this is why people need to shift their focus from trusting the optimizer to helping the optimizer.

Anyhow, just in case it didn't come through clearly, try it for yourself. If I get bored, maybe I'll generate some cases, and post them somewhere for everyone to see...but given I tried this with MSVC 2008 SP1, a fairly recent optimizer, and said to be a quite good one, I'm 100% certain you will see the same thing I did. Thus my insistance, try this for yourself, don't take my word for it!

Ok, down off the soapbox...Try to keep the rotten tomato throwing to a minimum 🙂

0 Likes
timattox
Adept I

Loop not executed correct number of times unless unrolled

Originally posted by: notzed

It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



Yes, I concur.  And I will further speculate that if he tried his code under the 11.5 Catalyst driver, his problem would go away.  Catalyst 11.6 introduced a bug that I described here:

"Swizzle bug in Catalyst 11.6 through 11.9"

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155209

And, for kicks, I tried arsenm's #pragma unroll workaround, and it worked around my swizzle problem as well.  It also changed the performance of my code, which I'm still investigating. (Looks like it sped things up on some GPU models, and slowed things down on others.... benchmarks are still running...)

0 Likes
corry
Adept III

Loop not executed correct number of times unless unrolled

Originally posted by: timattox
Originally posted by: notzed

It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



Yes, I concur.  And I will further speculate that if he tried his code under the 11.5 Catalyst driver, his problem would go away.  Catalyst 11.6 introduced a bug that I described here:

"Swizzle bug in Catalyst 11.6 through 11.9"

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155209

And, for kicks, I tried arsenm's #pragma unroll workaround, and it worked around my swizzle problem as well.  It also changed the performance of my code, which I'm still investigating. (Looks like it sped things up on some GPU models, and slowed things down on others.... benchmarks are still running...)



Might I suggest then, doing AMD's work for them and characterizing the OpenCL->IL bug exactly?  In other words, do exactly what I keep saying I'll do when I get time, go over the IL, spot the exact manifestation of the bug, and report if some input a is given, IL output B is given, which is wrong. 

I generally disagree with them always asking for test cases, specific examples, etc, but the one thing it does do is speed the process up a bit.  (Especially given their 3 month development cycle...)I find it odd though we haven't heard from AMD on this thread, or your swizzle thread for that matter...

0 Likes
MicahVillmow
Staff
Staff

Loop not executed correct number of times unless unrolled

Sorry for missing this, I've been busy recently and have not paid as much attention to the forum as I would like. When I run the kernel with an internal tool that throws random data at it, I correctly get 0x18 and 0x1800 in the first two slots of debug.
0 Likes
notzed
Challenger

Loop not executed correct number of times unless unrolled

Before I head wildly OT: I removed all the #pragma unroll's from my application and that fixed many problems.  The first routine I noticed was a simple separable convolution: a simple enough non-trivial test case.

I'm coffeed up and a bit under-slept so here goes ...

 

As for the optimization...did I mention this is a soapbox type issue for me? Just in case I didn't, its a soapbox issue for me...I love to get up on the soapbox and preach about this...anyhow, I am usually blunt, so I'll just go with that here. You're wrong.

 

Umm ok.  Calm down mate!  I know exactly where you're coming from, so there's no need to get so worked up.  I really have 'been there, done that' with this; far more than is sensible.

If a #pragma unroll gives you 10% improvement in a given loop for free, what's wrong with using it?  (and remember the OP had a bug where his code only worked with it set).

I'd written a long response but I might leave it for my blog, maybe.

I've been doing this stuff for more than 20 years and I've already tried it all. The only programming books I own are K&R's ANSI C, and Knuths TAOCP and given the latter uses extensive self-modifying assembly to demonstrate algorithms it's hardly one to quote about trusting the compiler.

Just chillax, and show a bit of respect: there's no need to go around calling people 'wrong' for sharing an opinion based on their own (extensive) experiences.

 

0 Likes