12 Replies Latest reply on Jul 6, 2009 3:41 PM by dgilmore

    BUG: I/O handling after read past eof

    tad@altair.com
      I/O bug, hangs exec, Ctrl-C disabled.

      Attempt to write after following code

       1000 read ( lunitqa, 2000, end=3000 ) env
       2000 format(a100)
            goto 1000
       3000 continue

      results in following error:

      ===================
      lib-4095 : UNRECOVERABLE library error
        Unable to find error message (check NLSPATH, file lib.cat)

      Encountered during a sequential formatted WRITE to unit 13
      Fortran unit 13 is connected to a sequential formatted text file:
        "filename"
       Current format:  7000 FORMAT( 1x, a,a,a, i10 )
                                  ^
      Signal 6 :: SIGABRT
      ===================

      After that the code sits idle, no CPU cycles, but Ctrl-C does not work. Ctrl-Z and kill -9 %1 required to stop.

      The above sequence works fine on all platforms: Unix/Linux/Windows/Mac using g77/gfortran and commercial compilers. Even if it is really incorrect usage, the error handling from runtime is unacceptable.

      regards

      Tadeusz

        • BUG: I/O handling after read past eof
          dgilmore

          Sorry about that, I'll try to get back soon with a work around for this issue.

          Doug

          • BUG: I/O handling after read past eof
            tad@altair.com

            Hi Doug,

            I found the workaround - adding backspace fixes the state of the i/o package, and still leaves the file at the end.

            But, I have two more strange failures, and both related to i/o.

            I am trying to port large FE code to this compiler. It already passed virtually all QA and actually beats Intel compiled execs by up to 10% on speed when running on AMD 64 CPU so it is worth pursuing, but these two failures are quite frustrating, and difficult to figure out. To be sure, this is mixed F/C/C++ code and this may expose some unexpected issues in i/o package...

            Thanks

              • BUG: I/O handling after read past eof
                dgilmore

                Good to hear that your making progress.  I attached the compressed file for the message catalog for the Fortran I/O library.  I attached an example how to use it.

                Note that you should not set NLSPATH when compiling, otherwise the Fortran compiler will immediately fail.

                Doug


                $ cat crfile.f90
                program foo
                  open (10, FILE='foo.dat', STATUS='new')
                end program foo
                $ openf90 crfile.f90
                $ NLSPATH=/tmp/dgilmore/lib/lib.cat ./a.out
                $ NLSPATH=/tmp/dgilmore/lib/lib.cat ./a.out

                lib-4051 : UNRECOVERABLE library error
                  The file must not exist prior to OPEN if STATUS is 'NEW'.

                Encountered during an OPEN of unit 10
                Fortran unit 10 is not connected
                Aborted
                $ ./a.out

                lib-4051 : UNRECOVERABLE library error
                  Unable to find error message (check NLSPATH, file lib.cat)

                Encountered during an OPEN of unit 10
                Fortran unit 10 is not connected
                Aborted
                $

                  • BUG: I/O handling after read past eof
                    dgilmore

                    Strange -- the attachment disappeared, I just reposted it.

                    Doug

                    • BUG: I/O handling after read past eof
                      tad@altair.com

                      Doug,

                       

                      thanks for the lib.cat file - why it is not part of the distro ? Is it because of NLSPATH ? Maybe it would be possible to provide text document with errors ?

                      The two bugs turned out to be easier to figure out than not:

                      Case 1 is because of overoptimization (debug exec passes QA). I'll figure out which file it is - should I send it for debugging to you ?

                      Case 2 is simple with lib.cat:

                      lib-4211 : UNRECOVERABLE library error
                        A WRITE operation tried to write a record that was too long.

                      Encountered during a sequential formatted WRITE to unit 68
                      Fortran unit 68 is connected to a sequential formatted text file: "run.hist"
                       Current format:   901 FORMAT(i8,10000e14.6)

                                                            ^
                      Thanks again!

                        • BUG: I/O handling after read past eof
                          dgilmore

                          > Doug,
                          >
                          > thanks for the lib.cat file - why it is not part
                          > of the distro ? Is it because of NLSPATH
                          Right, there is some initialization in the Fortran startup that we haven't sorted out yet.
                          >
                          > Maybe it would be possible to provide text
                          > document with errors ?
                          I attached a text file for the messages along with the program that extracted them.
                          >
                          > The two bugs turned out to be easier to figure out than not:
                          >
                          > Case 1 is because of overoptimization (debug
                          > exec passes QA). I'll figure out which file it
                          > is - should I send it for debugging to you ?
                          If you could, yes.  BTW, how does this failure manifest itself?  Is it a program fault?

                          Doug
                          >
                          > Case 2 is simple with lib.cat:
                          > ...
                          > Tadeusz 

                            • BUG: I/O handling after read past eof
                              tad@altair.com

                              Doug,

                              sorry for delay, I put your compiler on a backburner for a while.

                              Anyway - the bug was resulting in a wrong value of a counter, but at first glance it could be e.g. wrong reading from the file (it was associated with the date from a file). But the routine itself is high level driver routine and it might not be easy to track to you the issue. The routine does not have do-loops, just high level if/then decisions and various calls. I can send you but I will need to know what kind of NDA we can have.

                               

                              In the meantime I had another bug developed (ii.e. due to the minor changes to a code which already worked ok). The result is SIGFPE, and meaningless gdb trace:

                              Program received signal SIGFPE, Arithmetic exception.
                              _sd2udee () at ../../libu/numconv/mpp/sd2udee.c:228
                              228     ../../libu/numconv/mpp/sd2udee.c: No such file or directory.
                                      in ../../libu/numconv/mpp/sd2udee.c
                              Current language:  auto; currently asm
                              (gdb) where
                              #0  _sd2udee () at ../../libu/numconv/mpp/sd2udee.c:228
                              #1  0xffa5a5a5ffa5a5a5 in ?? ()
                              #2  0xffa5a5a5ffa5a5a5 in ?? ()
                              #3  0xfff5a5a5fff5a5a5 in ?? ()
                              #4  0xffa5a5a5ffa5a5a5 in ?? ()
                              #5  0x8000000000000000 in ?? ()
                              #6  0xffa5a5a5ffa5a5a5 in ?? ()
                              ...

                              We ran at least two other execs QAd with FPE so chances are that it is in the compiler.

                                • BUG: I/O handling after read past eof
                                  dgilmore

                                  Hi Tadeusz,

                                  Thanks for gettting back. Let me digest what you have sent.  I'll get back tomorrow on how we can proceed on resolving these problems.

                                  Doug

                                    • BUG: I/O handling after read past eof
                                      dgilmore

                                      > Doug,
                                      >
                                      > sorry for delay, I put your compiler on a backburner for a while.
                                      >
                                      > Anyway - the bug was resulting in a wrong value of a counter, but at
                                      > first glance it could be e.g. wrong reading from the file (it was
                                      > associated with the date from a file). But the routine itself is
                                      > high level driver routine and it might not be easy to track to you
                                      > the issue. The routine does not have do-loops, just high level
                                      > if/then decisions and various calls. I can send you but I will need
                                      > to know what kind of NDA we can have.
                                      I am in the process of sorting this out.   Could you tell us how large this code is?

                                      Is this an Optimization problem (does the program succeed when compiled with -O0 or -g)?
                                      >
                                      > In the meantime I had another bug developed (ii.e. due to the minor changes to a code which already worked ok). The result is SIGFPE, and meaningless gdb trace:
                                      >
                                      > Program received signal SIGFPE, Arithmetic exception.
                                      > _sd2udee () at ../../libu/numconv/mpp/sd2udee.c:228
                                      > 228     ../../libu/numconv/mpp/sd2udee.c: No such file or directory.
                                      >         in ../../libu/numconv/mpp/sd2udee.c
                                      > Current language:  auto; currently asm
                                      > (gdb) where
                                      > #0  _sd2udee () at ../../libu/numconv/mpp/sd2udee.c:228
                                      > #1  0xffa5a5a5ffa5a5a5 in ?? ()
                                      > #2  0xffa5a5a5ffa5a5a5 in ?? ()
                                      > #3  0xfff5a5a5fff5a5a5 in ?? ()
                                      > #4  0xffa5a5a5ffa5a5a5 in ?? ()
                                      > #5  0x8000000000000000 in ?? ()
                                      > #6  0xffa5a5a5ffa5a5a5 in ?? ()
                                      > ...
                                      > Tadeusz
                                      Sorry about the back-trace issue, I hope that we can get around to addressing this problem in the not-to-distant future.

                                      At this exception could you type:

                                      x/i $pc
                                      info all-registers

                                      and send back the output?

                                      Thanks,

                                      Doug

                                      • BUG: I/O handling after read past eof
                                        tad@altair.com

                                        Hi,

                                        The answers are probably disappointing ....

                                        The code is big: 86MB debug/ 36MB opt2 (21-144MB with all different platforms), Yes, this is optimization issue, -O1 for one routine and it works fine. This particular routine is 1200 lines, 50KB, so it is not bad - as I mentioned it is mostly if-then, top to bottom flow.

                                        The second problem, the one with invalid trace - I can't reproduce anymore. As I said - it appeared suddenly and it is gone now (I recompiled since, so I have no exec) I have option to rebuild with the same source, I may find time to do it on Monday.

                                        Tadeusz

                                          • BUG: I/O handling after read past eof
                                            dgilmore

                                            > Hi,
                                            >
                                            > The answers are probably disappointing ....
                                            >
                                            > The code is big: 86MB debug/ 36MB opt2 (21-144MB with
                                            > all different platforms), Yes, this is optimization
                                            > issue, -O1 for one routine and it works fine. This
                                            > particular routine is 1200 lines, 50KB, so it is not
                                            > bad - as I mentioned it is mostly if-then, top to
                                            > bottom flow.
                                            Is this a Fortran routine?  Is there any chance that an array of length one is being accessed (the array element is not used, the array symbol is just used as the base of a dynamically allocated array)?

                                            I attached a regression test for this bug, which we recently fixed.  If you are seeing this bug, the work around is to change the array size from one to two.
                                            >
                                            > The second problem, the one with invalid trace - I
                                            > can't reproduce anymore. As I said - it appeared
                                            > suddenly and it is gone now (I recompiled since, so I
                                            > have no exec) I have option to rebuild with the same
                                            > source, I may find time to do it on Monday.
                                            This one seems to be strange.  The problem is in the code checks whether a value to be printed is a NaN.  I wrote a test program that printed a NaN value that exercised this code and didn't see any problem.  It could be possibly a state corruption problem that could come an and go.

                                            Doug
                                            >
                                            > Tadeusz