12 Replies Latest reply on Oct 26, 2011 2:48 PM by Marix

    Loop not executed correct number of times unless unrolled

    arsenm

      I've run into a problem where an inner loop is not run the correct number of times unless I use a #pragma unroll on it. In what I was working on, the results were wrong and I noticed that everything was running much quicker than it should. I then added an atomic counter to get the number of times the loop actually runs, and found it was wrong. It works correctly if I force the unroll.

      I've put the CL kernel and IL + ISA here

        • Loop not executed correct number of times unless unrolled
          corry

          Ok, let me take a quick stab, just because there are 0 replies after a day.  Let me preface that with I haven't looked at the IL, because a, I think OpenCL IL is ugly (too many unnecessary definitions cluttering things up, sorry Micah, its just an opinion, and like noses, everyone has them and they all smell! :) ), and b, I just don't have the time...I don't want to just leech off the forums for help with my own problems, but I have to balance things...So, you may have already looked at the IL and discounted most of what I'm going to say :)

          First off, your test case is likely too simple. You're initializing an array inside a loop to the same constant value, and never changing it. I suspect that initialization was moved outside the loop. It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself.  I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox :) 

          Next, xs, ys, and zs are all set to the same value. I know the optimizer is very aggressive, so I suppose there's a chance it sees you are initializing them to constants, and never modify the values in xs, ys, and zs, only ax changes. I'm not sure what it would do with that, if it's smart enough to just turn it into PosX*3*256*24 or not...Given you said you see the counter increased, I assume thats not the case.  It may be partially unrolling the loop though.  Its a simple enough case, that its entirely possible it takes

          for (int k = 0; k < 256; ++k)       
          {           
               /* Changing the amount of work in here seems to change the
               number of iterations which actually happen. If I reduce
               the work, it happens correctly.
               */
               ax += xs[k] + ys[k] + zs[k];
               if (i == 0)
               {
                    /* At end this should be equal to 24 * 256. However
                       in this case it is only 1452 */
                    atomic_inc(&debug[1]);
               }
          }

          and turns it into

          for (int k = 0; k < 64; k+=4)
          {           
               /* Changing the amount of work in here seems to change the
               number of iterations which actually happen. If I reduce
               the work, it happens correctly.
               */
               ax += xs[k] + ys[k] + zs[k];
               ax += xs[k+1] + ys[k+1] + zs[k+1];
               ax += xs[k+2] + ys[k+2] + zs[k+2];
               ax += xs[k+3] + ys[k+3] + zs[k+3];
               if (i == 0)
               {
                    /* At end this should be equal to 24 * 256. However
                       in this case it is only 1452 */
                    atomic_inc(&debug[1]);
               }
          }

          Though that would net a counter of 1536 :-/ 

          There may be other optimizations at work though, especially on this test case.

          Also, back on the soapbox, unless there is good reason not to unroll the loop, you probably want to...if it is long, like the above, you probably want to partially unroll it, like above, though, in the above, were it exactly what I was doing, I'd probably unroll that 16 times, just to get a good results to compares ratio going...

          Anyhow, I suspect the answer lies in the IL.  Could even be an AMD bug :)  If I suddenly find myself with time, I'll poke into the IL a bit.  Good luck!

            • Loop not executed correct number of times unless unrolled
              arsenm

              This isn't my original problem. This is my reduced test case of it; the point is it is simple. I wasn't trying to get performance. Using the #pragma unroll got me CORRECT results unlike not unrolling. This is with the actual work removed but still exhibiting the same problem.

              I didn't initialize using the same thing in my original test case. I should not be getting different counts. If this was an "optimization" changing the number of times the loop happens, it is wrong. The side effecting atomic increment alone should prevent that optimization.

              • Loop not executed correct number of times unless unrolled
                notzed

                 

                It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself.  I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox :)


                FWIW I'm not sure this is the case any more.  Any extra code one adds gives the compiler extra constraints to work with which tends on average to reduce the optimisation possibilities, tie up more registers, etc.  It is obviously compiler and a little language dependent, but i've been generally surprised about this lately.

                It's pretty hard to consider this problem of arsenm's anything but a compiler bug.

                 

                  • Loop not executed correct number of times unless unrolled
                    corry

                     

                    Originally posted by: notzed
                    It would be good programming practice to *NOT* trust the optimizer, but to *HELP* it out a bit, by doing that sort of basic optimization yourself. I realize though, this is a test case....Not trusting the optimizer, but rather helping it, or dropping to a lower level is my soapbox :)


                    FWIW I'm not sure this is the case any more. Any extra code one adds gives the compiler extra constraints to work with which tends on average to reduce the optimisation possibilities, tie up more registers, etc. It is obviously compiler and a little language dependent, but i've been generally surprised about this lately.

                    It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



                    I don't want to write a book about this, so let me say, 2 people saying they're seeing issues, makes me far more likely to believe it...(though who cares if I believe lol, but I suspect it would also make AMD more likely to believe....) though I still haven't found time to look at the IL...I'm sure there's answers to be had there. All I was trying to say before was that too simple of test cases might not expose the problem you're having, and setting a few of the variables to each other would completly rule out that possibility. Even the atomic counter could be unrolled 16x...it should then have the right value, but, all I was trying to say was to keep the test cases simple, yet complicated enough to not let the optimizer have a chance to mess things up.  I also figured though I wasn't going to put a lot of effort into it, it sat with 0 posts...and it seems somehow those who need to see posts like this, often don't when they aren't seemingly hot topics on the forums :)  So I figured I'd get a reply on there, which would lead to some more, which eventually would get Micah who has a lot to do with the compiler, on here and getting things fixed if there was anything to fix :)

                    As for the optimization...did I mention this is a soapbox type issue for me? Just in case I didn't, its a soapbox issue for me...I love to get up on the soapbox and preach about this...anyhow, I am usually blunt, so I'll just go with that here. You're wrong. Most of the people quoting that, eventually find their sources to be complaints about Java, and optimizing Java code, which makes trivial differences compared to the ill advised running in a VM when performance is a consideration. Yeah, optimizing Java code makes very little difference...you are already byte reversing and sign extending all your math on x86 processors, which alone is enough to add ludicrous amounts of overhead...but I digress.

                    I'll simply say, prove it for yourself. I had to. I believed all that, that the optimizer was omniciant, that unrolling things made code too big and killed locality, that I'd use up all the registers, etc. The optimizer is far from omniciant, and try as you might to get it to use up all the registers, it won't. GPUs I think are at least different in that respect...they'll use all the registers and more...register pigs IMO. Cache sizes today are larger than 486 system ram size. No one cares that your executable is now 500KB, not even the hardware, unrolling is good, and locality takes some serious effort to break....(Bad OO programmers manage to all the time though :) )

                    Like I said though, prove it for yourself. Write up test cases, compare things like say a 32 case if/else vs a switch. To test the differnce though, keep the bodies small. Compare loop unrolling, I tested partial unrolling to 128, which intel advised against in their optimization manual, saying 16 was the number. Compare instrinsics like bswap to logical operations...and the list goes on. It actually becomes fun to see how the optimizer is going to ruin a section of code, written one way, and how amazing its going to do on functionally identical code written the "right way". Also, don't just time it, get a disassembler, and look at what it actually produced, note the registers that sit unused, meanwhile things getting shifted on and off the stack....yeah....brilliant omniciant optimizer right?

                    I was shocked...I had figured a lot of this was easy for the optimizer to pick up, because its easy for us to pick up....it's actually a very difficult problem, and so while I trash optimizers a lot, I have the utmost respect for their authors. Its a lot like the XKCD entitled "Physicists" (http://xkcd.com/793/) Yeah, the problem seems simple, and you can just model it after x, and add contraints y, and what's the big deal, why can't it just do everything? Truth is, its not easy, this is why people need to shift their focus from trusting the optimizer to helping the optimizer.

                    Anyhow, just in case it didn't come through clearly, try it for yourself. If I get bored, maybe I'll generate some cases, and post them somewhere for everyone to see...but given I tried this with MSVC 2008 SP1, a fairly recent optimizer, and said to be a quite good one, I'm 100% certain you will see the same thing I did. Thus my insistance, try this for yourself, don't take my word for it!

                    Ok, down off the soapbox...Try to keep the rotten tomato throwing to a minimum :)

                      • Loop not executed correct number of times unless unrolled
                        notzed

                        Before I head wildly OT: I removed all the #pragma unroll's from my application and that fixed many problems.  The first routine I noticed was a simple separable convolution: a simple enough non-trivial test case.

                        I'm coffeed up and a bit under-slept so here goes ...

                         

                        As for the optimization...did I mention this is a soapbox type issue for me? Just in case I didn't, its a soapbox issue for me...I love to get up on the soapbox and preach about this...anyhow, I am usually blunt, so I'll just go with that here. You're wrong.

                         

                        Umm ok.  Calm down mate!  I know exactly where you're coming from, so there's no need to get so worked up.  I really have 'been there, done that' with this; far more than is sensible.

                        If a #pragma unroll gives you 10% improvement in a given loop for free, what's wrong with using it?  (and remember the OP had a bug where his code only worked with it set).

                        I'd written a long response but I might leave it for my blog, maybe.

                        I've been doing this stuff for more than 20 years and I've already tried it all. The only programming books I own are K&R's ANSI C, and Knuths TAOCP and given the latter uses extensive self-modifying assembly to demonstrate algorithms it's hardly one to quote about trusting the compiler.

                        Just chillax, and show a bit of respect: there's no need to go around calling people 'wrong' for sharing an opinion based on their own (extensive) experiences.

                         

                          • Loop not executed correct number of times unless unrolled
                            corry

                             

                            Originally posted by: notzed

                            Umm ok. Calm down mate! I know exactly where you're coming from, so there's no need to get so worked up. I really have 'been there, done that' with this; far more than is sensible.



                             

                            This is why someone needs to come up with some sort of proper text accenting system, better and more acceptable than emoticons. You read that with accenting on the text as though I was "worked up". Anything but! I had figured use of the word soapbox would mean something different, but given your use of the word mate, I suppose you're not from the US? Though chillax would make me think California, either way, perhaps it means something different where you come from. Out here its kind of a joke saying this is an important issue to me, and I talk about it alot, but people rarely listen, and though sometimes it may seem like I'm preaching, and therefore would care with religious fervor, the very fact that I bring out the word soapbox, means I'm lauging at myself for it, and have come to grips with it being unpopular, and so I'm anything but worked up. Yes, thats a long explanation for a single word, but that's the meaning it carries around here. I also tried to indicate it was a joke by repeating it in such an obvious way with the pattern "statement", did I mention "statement", because if I didn't, "statement".  That pattern was used by comedians, and US sitcoms, so its supposed to have a humorous connotation, thus the intentional use....so relax and cheer up! Not everyone on internet forums are angry! I for one am usually happy and very easy going....only when I'm asking my own questions, and getting stonewalled with marketing type answers do I ever get "worked up", and I usually end up using words like "frustrated" to indicate :) When you read things, try not to put the negative, snippy, snarky, or otherwise negetive inflections on things unless its clear the person is being very negetive. I tried my hardest to show that was not the tone I was using (see description of connotation of soapbox, and the repeating pattern again for how), but seems, I still failed! Worst case is you think people you may never actually meet are friendlier than they are....thats not such a bad thing! Perhaps I've been duped with highly negetive postings, taken positivly. Does it have any effect on me? Nope!

                            As for "wrong", you applied it broadly to yourself, rather than how I applied it to your making of the statement saying this is not the case anymore, to my statement of you need to help the optimizer rather than trust the optimizer. Thus the inclusion of the quote. That statement was, is, and will continue to be wrong for some time, and thats not an opinion, thats fact. It's quite easy to write bad (bad as in slow) C/C++ code. I wasn't going to sugarcoat that, despite my not being "worked up", and I never will. As I said, its a soapbox issue for me, with all the meaning given in the first paragraph, so I will continue to "preach" it, and I *KNOW* what I'm saying is factual, not subjective, not an opinion.  I have code for the proof, but it was written for the company, on company time, so the company owns it, so I can't just post it, though perhaps if I find some more time somehow, I'll start an optimization blog...

                            See, this is why we need better text accenting schemes...all that *just* to explain how what I wrote should have been read, and how I read what was written...so much wasted text...

                            More on topic, I never argued against using pragma unroll, merely stated you shouldn't trust it.  Remember, in the C spec, pragma unroll is merely a hint, just like inline. OpenCL may be different, but given it's C like nature, I doubt it. So even this workaround for the stated problem is not necessarily a workaround should the optimizer (probably wrongly) decide not to unroll the loop. (Though AMD's optimizer is very aggressive as I've been stating, so chances are pretty good it will listen, or even just always listen reguardless, but thats internal details, we only know what we're told)  I just want everyone to understand that it, nor other directives, optimization settings, or the optimizer itself are magic black boxes that through tremendous heat and pressure turn dung into diamonds. At best, I'd say it managaes http://www.changingworldtech.com 's TCP turning dung into crude oil, but give it bad inputs and it will turn dung into dung.

                            Since you have a little background...here's a little on mine...I tried starting programming at age 5 on a commodore 64, but my brain wasn't really mature enough, however, I still showed interest, and by 8, I had my dad's old 286 with borland C++ 2.0, and a book on programming, and off I went. Started college classes at age 13. Always focused on video games which is synonomous with optimization. I could list the books I have on the topic, but it would make this far too long. I currently work at a job where optimization is a function I perform, though most of my previous jobs, it was still appreciated, and I'm at least decent at it. I bet Michael Abrash could sqeeze another 10% out my code, but that dude is something else...No, I didn't go into the game industry, so you won't see any Corrys in the credits, (or if you do, its not me).  I decided $20K less/year, 80+ hour work weeks, and higher stress was just downright stupid compared to the defense industry where I had a job offer....I also have interests in theoretical physics, mechanical engineering, electrical engineering, and deriviatives of combining all of the above, enough that I have a small research company attempting to build/experiment with one such derivative idea. Thus my always being way too busy!

                            Anyhow, keep it positive, someone dig through the IL, for the OP, Micah said he can't reproduce so either its fixed internally, or something else is up. There's a chance tonight I may get a chance to look at it if no one else does...I've got a lot to do...

                        • Loop not executed correct number of times unless unrolled
                          timattox

                           

                          Originally posted by: notzed

                          It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



                          Yes, I concur.  And I will further speculate that if he tried his code under the 11.5 Catalyst driver, his problem would go away.  Catalyst 11.6 introduced a bug that I described here:

                          "Swizzle bug in Catalyst 11.6 through 11.9"

                          http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155209

                          And, for kicks, I tried arsenm's #pragma unroll workaround, and it worked around my swizzle problem as well.  It also changed the performance of my code, which I'm still investigating. (Looks like it sped things up on some GPU models, and slowed things down on others.... benchmarks are still running...)

                            • Loop not executed correct number of times unless unrolled
                              corry

                               

                              Originally posted by: timattox
                              Originally posted by: notzed

                              It's pretty hard to consider this problem of arsenm's anything but a compiler bug.



                              Yes, I concur.  And I will further speculate that if he tried his code under the 11.5 Catalyst driver, his problem would go away.  Catalyst 11.6 introduced a bug that I described here:

                              "Swizzle bug in Catalyst 11.6 through 11.9"

                              http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155209

                              And, for kicks, I tried arsenm's #pragma unroll workaround, and it worked around my swizzle problem as well.  It also changed the performance of my code, which I'm still investigating. (Looks like it sped things up on some GPU models, and slowed things down on others.... benchmarks are still running...)



                              Might I suggest then, doing AMD's work for them and characterizing the OpenCL->IL bug exactly?  In other words, do exactly what I keep saying I'll do when I get time, go over the IL, spot the exact manifestation of the bug, and report if some input a is given, IL output B is given, which is wrong. 

                              I generally disagree with them always asking for test cases, specific examples, etc, but the one thing it does do is speed the process up a bit.  (Especially given their 3 month development cycle...)I find it odd though we haven't heard from AMD on this thread, or your swizzle thread for that matter...

                                • Loop not executed correct number of times unless unrolled
                                  timattox

                                   

                                  Originally posted by: corryMight I suggest then, doing AMD's work for them and characterizing the OpenCL->IL bug exactly?


                                  Actually, as best I can recall (I'm away from my GPU workstation today), the swizzle/loop-unrolling bug I encountered was not going from OpenCL to IL, but instead the bug manifested itself going from IL to ISA.  But to be honest, I didn't look much at the IL.  I speculate that the the "pragma unroll" bypasses the bug since by the time the IL->ISA compiler in Catalyst is working on my code, the four loop bodies and the swizzles in them, have all been directly listed out four times.  Without the "pragma unroll", the IL->ISA compiler is doing it's own loop unrolling and messing up keeping track of the swizzle.  (Oh, and the swizzle is being fully optimzed away and does not appear in the ISA file, it's simply changing which field in the 4 element vector is being worked on per loop body.)

                                  On a side note, AMD did respond to my bug report support ticket, but I've not heard back if they could reproduce it or if it's fixed yet, etc.

                                  10/26/2011 update: At AMD's request, I submitted a full code to reproduce the bug, along with my test results on various GPUs, Catalyst versions, and SDK versions.  Hopefully AMD Catalyst developers can make sense of it and get this fixed.  Actually, I do hope this is just one bug, and not multiple bugs that have a similar workaround.

                                    • Loop not executed correct number of times unless unrolled
                                      Marix

                                      I am also experiencing this bug with APP 2.5 in my application. If required I could try to also create a sample code showing this.

                                      Some slight systematic I saw was that it seems to get the correct number of iterations without the pragma if the compile time defined iteration count is something like 4 or 8, but with odd numbers - in my case 13 - it seems to get off track. Specifying "#pragma unroll 13" fixes it in that case.

                            • Loop not executed correct number of times unless unrolled
                              notzed

                               

                              Originally posted by: arsenm I've run into a problem where an inner loop is not run the correct number of times unless I use a #pragma unroll on it. In what I was working on, the results were wrong and I noticed that everything was running much quicker than it should. I then added an atomic counter to get the number of times the loop actually runs, and found it was wrong. It works correctly if I force the unroll.

                               

                              I've put the CL kernel and IL + ISA here

                               

                              Oddly, i'm getting all sorts of problems when i do unroll one particular routine the isa is about 20% bigger for every loop (the loop is unrolled anyway, constant bounds), and the results are incorrect.  Cayman hardware.

                              On the previous platform we're using they helped enough to use them, but I can just remove them now.  However, it is a bit of a dent in the faith of the compiler.

                               

                              • Loop not executed correct number of times unless unrolled
                                MicahVillmow
                                Sorry for missing this, I've been busy recently and have not paid as much attention to the forum as I would like. When I run the kernel with an internal tool that throws random data at it, I correctly get 0x18 and 0x1800 in the first two slots of debug.