2 Replies Latest reply on Jan 11, 2013 1:16 PM by aciani1@uic.edu

    ACML and CentOS 5

    aciani1@uic.edu

      I am obtaining unexpected results and segmentation faults using ACML 4.4.0 gfortran64 on a CentOS 5.8 system using gcc 4.1.2 or gcc 4.4.6.

       

      I have performed some debugging, and it seems that after the call to ACML, the stack has been altered, clobbered or smashed.

       

      For example:

      Breakpoint 2, main (argc=17, argv=0x7fffffffe738) at sw_fit.c:209

      209             dspsv(UPLO, Njac, 1, JTWJ, &i, JTWDY, Njac, &j);

      (gdb) x /20xg $rsp

      0x7fffffffe5a0: 0x00002aaaacc68000      0x0000003028e0cf65

      0x7fffffffe5b0: 0x00007fffffffe738      0x0000001100000000

      0x7fffffffe5c0: 0x0000000000000003      0x0000000000609010

      0x7fffffffe5d0: 0x0000000000608ce0      0x0000001000000003

      0x7fffffffe5e0: 0x00007fff00000007      0x0000002a00000010

      0x7fffffffe5f0: 0x7fffffffffffffff      0x550000302901cbc0

      0x7fffffffe600: 0x0000000000609030      0x000000000060afc0

      0x7fffffffe610: 0x000000000062b430      0x000000000062c940

      0x7fffffffe620: 0x000000000060f140      0x0000000000608c00

      0x7fffffffe630: 0x0000000000616820      0x00000000006265b0

      (gdb) cont

      Continuing.

       

      Breakpoint 3, main (argc=17, argv=0x7fffffffe738) at sw_fit.c:211

      211             res = 0.0;

      (gdb) x /20xg $rsp

      0x7fffffffe5a0: 0x00002aaa0000002a      0x00007fffffffe5d8

      0x7fffffffe5b0: 0x00007fffffffe738      0x0000001100000000

      0x7fffffffe5c0: 0x0000000000000003      0x0000000000609010

      0x7fffffffe5d0: 0x0000000000608ce0      0x000000010000002a

      0x7fffffffe5e0: 0x0000000300000002      0x0000000500000004

      0x7fffffffe5f0: 0x0000000700000006      0x0000000900000008

      0x7fffffffe600: 0x0000000b0000000a      0x0000000d0000000c

      0x7fffffffe610: 0x0000000f0000000e      0x0000001100000010

      0x7fffffffe620: 0x0000001300000012      0x0000001500000014

      0x7fffffffe630: 0x0000001700000016      0x0000001900000018

       

      It appears as though the stack is clobbered or smashed with a numerical sequence.  I have had other strange things occur, such as numerical constants being changed.  For example, LDA might be 4 before a call to ACML, and then be 2 afterward.

       

      The problem is occurring with multiple programs, from simple command line tools that solve small sets of linear equations, to density functional theory codes.

       

      This behavior is also occurring with ACML 3.6.0.  Static or shared libraries.

        • Re: ACML and CentOS 5
          chipf

          The stack print that you have may not be providing useful information.  You need to look also at the disassembly view to truly understand how that stack area can be changing.  For instance storage for the pivot table resides on the stack, and you can see that it has been filled in by the call starting at ...E5DC.

           

          I notice that you have passed the address of "i" for the pivot table.  Is "i" declared as a large enough integer array?  If not then this will certainly overwrite other variables.  It should have size of at least Njac.  If is an array, then you don't need the address declaration, and in fact the compiler should warn with the syntax you provided if it is an array.

           

          Naturally you would have proven that your test program works properly with the Netlib BLAS and LAPACK.

          If this is a problem with ACML, it is likely fixed in newer versions.

            • Re: ACML and CentOS 5
              aciani1@uic.edu

              Thanks Chip,

               

              Good catch.  The pivot table was indeed on the stack, and was insufficiently sized.  Depending on how routines were optimized, sometimes things were overwritten and other times things were fine.  Several codes declared IPIV on the stack, and were throwing out problems.  These were modified to allocate IPIV on the heap, based on the size of the problem.