12 Replies Latest reply on May 26, 2012 4:39 PM by yurtesen

    APPML segmentation faults

    yurtesen

      I am getting segmentation faults with the APPML when trying to run the examples... OpenCL programs normally work fine and clinfo returns AMD GPUs in its output and I ran the appmlEnv.sh script...

       

      # ./example_sgemv

      Segmentation fault (core dumped)

      #

      Any ideas why this might happen?

        • Re: APPML segmentation faults
          yurtesen

          Strangely binaries give out segmentation faults at sl6 and fedora systems while they seem to be working on ubuntu... Anyway, recompiling solves the problem...

            • Re: APPML segmentation faults
              kknox

              Hi Yurtesen~

               

              You recompiled the sample programs and it worked correct?  If you run ldd on the example programs that we pre-compiled for you, what does it show?  I'm suspecting that if a recompile solved the problem, sl6 and fedora do not have the correct dependencies on the system that the example programs expected, and ldd should show this.  I'm glad that the recompile works.

              1 of 1 people found this helpful
                • Re: APPML segmentation faults
                  yurtesen

                  Hello,

                  Interestingly ldd shows same things on Ubuntu and Fedora (just different paths).

                  Fedora:

                  $ ldd /opt/clAmdBlas-1.7.257/bin64/example_sgemm

                      linux-vdso.so.1 =>  (0x00007fff0d990000)

                      libOpenCL.so.1 => /usr/lib64/libOpenCL.so.1 (0x0000003fd3e00000)

                      libclAmdBlas.so.1 => /opt/clAmdBlas-1.7.257/lib64/libclAmdBlas.so.1 (0x00007fc4b0630000)

                      libm.so.6 => /lib64/libm.so.6 (0x000000334a000000)

                      libc.so.6 => /lib64/libc.so.6 (0x0000003349000000)

                      libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e23a00000)

                      libdl.so.2 => /lib64/libdl.so.2 (0x0000003349800000)

                      libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003356000000)

                      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000334a400000)

                      /lib64/ld-linux-x86-64.so.2 (0x0000003348c00000)

                  Ubuntu (Sorry about the version difference, I just instaled this one to my laptop) I think this version I tested on SL6 though:

                  $ ldd /opt/clAmdBlas-1.8.269/bin64/example_sgemm

                      linux-vdso.so.1 =>  (0x00007fff9d7be000)

                      libOpenCL.so.1 => /usr/lib/libOpenCL.so.1 (0x00007f604e902000)

                      libclAmdBlas.so.1 => /opt/clAmdBlas-1.8.269/lib64/libclAmdBlas.so.1 (0x00007f604e63d000)

                      libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f604e343000)

                      libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f604df86000)

                      libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f604dd68000)

                      libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f604db64000)

                      libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f604d864000)

                      libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f604d64d000)

                      /lib64/ld-linux-x86-64.so.2 (0x00007f604eb1a000)

                  and after recompilation on Fedora (works):

                  $ ldd a.out

                      linux-vdso.so.1 =>  (0x00007fff10480000)

                      libclAmdBlas.so.1 => /opt/clAmdBlas-1.7.257/lib64/libclAmdBlas.so.1 (0x00007f8561b48000)

                      libOpenCL.so => /usr/lib64/libOpenCL.so (0x00007f8561928000)

                      libc.so.6 => /lib64/libc.so.6 (0x0000003349000000)

                      libOpenCL.so.1 => /usr/lib64/libOpenCL.so.1 (0x0000003fd3e00000)

                      libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003356000000)

                      libm.so.6 => /lib64/libm.so.6 (0x000000334a000000)

                      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000334a400000)

                      libdl.so.2 => /lib64/libdl.so.2 (0x0000003349800000)

                      libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x000000334a800000)

                      /lib64/ld-linux-x86-64.so.2 (0x0000003348c00000)

                      libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e23a00000)

                  I went ahead and ran the non-working example with valgrind. (I am not sure of what to make of this but here is the output)

                  $ valgrind /opt/clAmdBlas-1.7.257/bin64/example_sgemm

                  ==23075== Memcheck, a memory error detector

                  ==23075== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.

                  ==23075== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info

                  ==23075== Command: /opt/clAmdBlas-1.7.257/bin64/example_sgemm

                  ==23075==

                  ==23075== Conditional jump or move depends on uninitialised value(s)

                  ==23075==    at 0xAAD0788: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAACFEE7: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAD0160: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAC03A2: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAABF6E5: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAB03C5: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xA895835: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xA8958AD: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAE110D: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAE18B0: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAE1AB1: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==    by 0xAAD8293: ??? (in /usr/lib64/libaticaldd.so)

                  ==23075==

                  ==23075== Syscall param write(buf) points to uninitialised byte(s)

                  ==23075==    at 0x33490E421D: ??? (in /lib64/libc-2.14.90.so)

                  ==23075==    by 0x917308B: _libelf_update_elf (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x9173EA8: elf_update (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x916E442: amd::OclElf::dumpImage(char**, unsigned long*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x90FB7F9: device::ClBinary::createElfBinary(bool, device::Program::type_t) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x91528A7: gpu::NullProgram::createBinary(amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x915729C: gpu::NullProgram::linkImpl(amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x90FDD3D: device::Program::build(std::string const&, char const*, amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x910B337: amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x9139C58: gpu::Device::BlitProgram::create(gpu::Device*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x913C6A7: gpu::Device::create(unsigned int) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x913E2FE: gpu::Device::init() (in /usr/lib64/libamdocl64.so)

                  ==23075==  Address 0xe95a9f5 is 24,213 bytes inside a block of size 171,516 alloc'd

                  ==23075==    at 0x4A074CD: malloc (vg_replace_malloc.c:236)

                  ==23075==    by 0x9172B33: _libelf_update_elf (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x9173EA8: elf_update (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x916E442: amd::OclElf::dumpImage(char**, unsigned long*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x90FB7F9: device::ClBinary::createElfBinary(bool, device::Program::type_t) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x91528A7: gpu::NullProgram::createBinary(amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x915729C: gpu::NullProgram::linkImpl(amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x90FDD3D: device::Program::build(std::string const&, char const*, amd::option::Options*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x910B337: amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x9139C58: gpu::Device::BlitProgram::create(gpu::Device*) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x913C6A7: gpu::Device::create(unsigned int) (in /usr/lib64/libamdocl64.so)

                  ==23075==    by 0x913E2FE: gpu::Device::init() (in /usr/lib64/libamdocl64.so)

                  ==23075==

                  ==23075== Conditional jump or move depends on uninitialised value(s)

                  ==23075==    at 0x3FD3E0341F: clCreateContext (in /usr/lib64/libOpenCL.so.1)

                  ==23075==    by 0x400DCC: main (in /opt/clAmdBlas-1.7.257/bin64/example_sgemm)

                  ==23075==

                  ==23075== Use of uninitialised value of size 8

                  ==23075==    at 0x3FD3E03421: clCreateContext (in /usr/lib64/libOpenCL.so.1)

                  ==23075==    by 0x400DCC: main (in /opt/clAmdBlas-1.7.257/bin64/example_sgemm)

                  ==23075==

                  ==23075== Invalid read of size 8

                  ==23075==    at 0x3FD3E03440: clCreateContext (in /usr/lib64/libOpenCL.so.1)

                  ==23075==    by 0x400DCC: main (in /opt/clAmdBlas-1.7.257/bin64/example_sgemm)

                  ==23075==  Address 0xa35ffc308c48368 is not stack'd, malloc'd or (recently) free'd

                  ==23075==

                  ==23075==

                  ==23075== Process terminating with default action of signal 11 (SIGSEGV)

                  ==23075==  General Protection Fault

                  ==23075==    at 0x3FD3E03440: clCreateContext (in /usr/lib64/libOpenCL.so.1)

                  ==23075==    by 0x400DCC: main (in /opt/clAmdBlas-1.7.257/bin64/example_sgemm)

                  ==23075==

                  ==23075== HEAP SUMMARY:

                  ==23075==     in use at exit: 16,325,408 bytes in 10,225 blocks

                  ==23075==   total heap usage: 378,046 allocs, 367,821 frees, 197,355,491 bytes allocated

                  ==23075==

                  ==23075== LEAK SUMMARY:

                  ==23075==    definitely lost: 10,901 bytes in 1,069 blocks

                  ==23075==    indirectly lost: 1,309,608 bytes in 4 blocks

                  ==23075==      possibly lost: 1,224,759 bytes in 3,111 blocks

                  ==23075==    still reachable: 13,780,140 bytes in 6,041 blocks

                  ==23075==         suppressed: 0 bytes in 0 blocks

                  ==23075== Rerun with --leak-check=full to see details of leaked memory

                  ==23075==

                  ==23075== For counts of detected and suppressed errors, rerun with: -v

                  ==23075== Use --track-origins=yes to see where uninitialised values come from

                  ==23075== ERROR SUMMARY: 7 errors from 5 contexts (suppressed: 4 from 3)

                  Killed

                    • Re: APPML segmentation faults
                      kknox

                      I've run valgrind on our library before too, and saw many of the same warnings from libamdocl64.so.  I had the driver & runtime team take a look at them, and they verified that they were false positives. 

                       

                      The interesting error comes from the GPF in clCreateContext().  I can't think of why the program would crash in that routine, except that maybe parameters are not passed on the stack correctly, or they got corrupted somehow.  I see from after you recompiled on Fedora, you link with both libOpenCL.so and libOpenCL.so.1.  Something doesn't seem right there, libOpenCL.so should only be a symbolic link to libOpenCL.so.1.  Maybe you have multiple installations of the APP SDK installed and old files are hanging around.

                       

                      Kent

                        • Re: APPML segmentation faults
                          yurtesen

                          In my Fedora box, I have Intel SDK is installed also... But it was long time ago and I upgraded amd-app-sdk several times in between and I assumed it would overwrite libOpencl* files. and indeed the date corresponds to when I installed SDK 2.7

                           

                          In sl6, there is the nvidia sdk and ubuntu test case was my laptop which had amd-sdk only. I have an sl6 virtual machine here which is clean, I will try to install sdk there and return back to you.

                           

                          One difference I saw was the libnuma which is missing in ubuntu but exists on fedora and sl6?

                           

                          As you can see, the links are as they are suppose to be:

                          Fedora16:

                          [eyurtese@extremum test]$ ls -al /usr/lib/libOpenCL.so*

                          lrwxrwxrwx 1 root root    14 Nov 29 08:36 /usr/lib/libOpenCL.so -> libOpenCL.so.1

                          -rw-r--r-- 1 root root 26632 May 15 15:20 /usr/lib/libOpenCL.so.1

                          [eyurtese@extremum test]$ ldconfig -p |grep OpenCL

                                  libOpenCL.so.1 (libc6,x86-64) => /usr/lib64/libOpenCL.so.1

                                  libOpenCL.so.1 (libc6) => /usr/lib/libOpenCL.so.1

                                  libOpenCL.so (libc6,x86-64) => /usr/lib64/libOpenCL.so

                                  libOpenCL.so (libc6) => /usr/lib/libOpenCL.so

                          [eyurtese@extremum test]$

                          • Re: APPML segmentation faults
                            yurtesen

                            OK. Now I found a lot of interesting information. First of all, it appears the missing GPU is the problem (and nothing to do with libraries or operating system) but when I re-compiled source with -O3, it sort of accidentally fixes the problem.

                             

                            example_sgemm.c is looking for a GPU

                             

                            err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

                             

                            but I have only CPU on the node where I build programs. OK this should make the example non-functional, but should not cause it to give segmentation fault. (in my opinion )

                             

                            Check out the following:

                            # gcc -O1 -I$CLAMDBLAS_INCLUDE -I$OPENCL_INCLUDE -lOpenCL -lclAmdBlas example_sgemm.c ; ./a.out

                            Segmentation fault

                            # gcc -O2 -I$CLAMDBLAS_INCLUDE -I$OPENCL_INCLUDE -lOpenCL -lclAmdBlas example_sgemm.c ; ./a.out

                            clAmdBlasSgemm() failed with -1022

                            clAmdBlasSgemmEx() failed with -1022

                            #

                             

                            I think In all cases the program should fail with -1022 but because of the unitialized id, but it does cause segmentation fault. Consequently, -O2 and -O3 causes 'device' id to be initialized to 0, but -O1 and lower causes 'device' to get a random value.(thus causing crash).

                             

                            So, I found that calling clGetDeviceIDs with uninitialized 'platform' id or clCreateContext with unitialized 'device' id is causing the segmentation fault. I expected they would return CL_INVALID_PLATFORM or  CL_INVALID_DEVICE instead of crashing. So, my question now is while it is a user error to pass an unitialized variable, shouldnt app-sdk be able to cope with it?

                             

                            Thanks,

                            Evren

                              • Re: APPML segmentation faults
                                kknox

                                Hi Yurtesen~

                                 

                                I talked to the runtime team, and since cl_platform_id and cl_device_id are pointers, the OpenCL API's will check to see if they are NULL, and return appropriate return codes.  If the pointers are not NULL, they decided that it is too costly to determine a valid pointer or not.

                                 

                                I have checked in basic error checking for the OpenCL initialization code in our samples.  Our next release will have these updated samples.

                                 

                                Kent

                                1 of 1 people found this helpful
                                  • Re: APPML segmentation faults
                                    yurtesen

                                    Hello Kent,

                                     

                                    1-) Would it be possible to have a default initialization value for cl_platform_id and cl_device_id ? (which would be NULL). This could prevent headaches in future also I think some other types such as cl_int etc also have defaults?

                                     

                                    2-) Even though the value is unintialized cant you get away with checking the range of it only? Because you will get a random number from the uninitialized variable. Indeed I am able to print out the value contained in  cl_platform_id and cl_device_id without causing a segmentation fault. If I set the cl_device_id to 1445906816 before creating context, then it also causes a crash even thought he value is initialized.

                                     

                                    So,, wouldnt it be possible for SDK to simply check that this value is within acceptable range. For example, "if" the SDK is expecting the platform_id's to be between 1 and 256 and the random value is 1445906816 then it could simply bail out with a single if statement.

                                     

                                    3-) The APP-SDK samples uses some common code for detecting the platforms etc in a very nice way. Couldnt you use them? Because the examples work for platform 0 device 0 only?

                                     

                                    Thanks,

                                    Evren

                                      • Re: APPML segmentation faults
                                        kknox

                                        1-) Would it be possible to have a default initialization value for cl_platform_id and cl_device_id ? (which would be NULL). This could prevent headaches in future also  I think some other types such as cl_int etc also have defaults?

                                        I am modifying the sample programs to default the values to NULL.  However, if you are asking whether C defaults values for pointers or ints, it does not.

                                        2-) Even though the value is unintialized cant you get away with checking the range of it only? Because you will get a random number from the uninitialized variable. Indeed I am able to print out the value contained in  cl_platform_id and cl_device_id without causing a segmentation fault. If I set the cl_device_id to 1445906816 before creating context, then it also causes a crash even thought he value is initialized.

                                        So,, wouldnt it be possible for SDK to simply check that this value is within acceptable range. For example, "if" the SDK is expecting the platform_id's to be between 1 and 256 and the random value is 1445906816 then it could simply bail out with a single if statement.

                                        It's hard to check the validity of a pointer, and is something that the runtime team decided was too costly.  A pointer value can have almost any value in your virtual address space (the one special value is NULL).  This is not just an index into some table, it's a true pointer to some memory address.

                                        3-) The APP-SDK samples uses some common code for detecting the platforms etc in a very nice way. Couldnt you use them? Because the examples work for platform 0 device 0 only?

                                        This is a decent idea, and maybe one that we could pursue.  I think we would still have to convert and massage code, because most of our samples are C files and not C++.  My hope is that with my fixes, the sample programs will be well behaved now.

                                          • Re: APPML segmentation faults
                                            yurtesen

                                            kknox wrote:

                                            3-) The APP-SDK samples uses some common code for detecting the platforms etc in a very nice way. Couldnt you use them? Because the examples work for platform 0 device 0 only?

                                            This is a decent idea, and maybe one that we could pursue.  I think we would still have to convert and massage code, because most of our samples are C files and not C++.  My hope is that with my fixes, the sample programs will be well behaved now.

                                            I will let you know if I come up with a simple codeset for this You should convert your examples to C++, it would also be nice to have C++ interface for clAmdBlas.

                                            • Re: APPML segmentation faults
                                              yurtesen

                                              kknox wrote:

                                               

                                              1-) Would it be possible to have a default initialization value for cl_platform_id and cl_device_id ? (which would be NULL). This could prevent headaches in future also  I think some other types such as cl_int etc also have defaults?

                                              I am modifying the sample programs to default the values to NULL.  However, if you are asking whether C defaults values for pointers or ints, it does not.

                                              Yes I was talking about if it would be possible to set C defaults for cl_* types... but I think there might be a workaround wothout that...before explaining that, here is a test program:

                                              #include <stdio.h>
                                              #include <CL/cl.h>
                                              
                                              void main() {
                                              
                                                 int *pointer;
                                                 printf("int pointer is %p\n", pointer);
                                              
                                                 cl_int err;
                                                 cl_platform_id platform;
                                                 cl_device_id device;
                                              
                                                 printf("err: %p platform: %p device: %p address\n", &err, &platform, &device);
                                                 printf("err: %d platform: %d device: %d value\n", err, platform, device);
                                              
                                              }
                                              
                                              

                                               

                                              The output is interesting... 'pointer' is set as 'nil' while the cl_* types have some random addresses... at each run, the addresses change but values stay constant. (at least in my system). If I compile with -O3, more or less everything starts changing but int prints always 'nil'. This might be something specific to GCC of course, I imagine it is not C standard.

                                               

                                              Back to my  idea ... If the clGetPlatformIDs does not find a platform, it does not update the 'platform' pointer (assumption). Yet, it could simply set it NULL, couldnt it? The same goes for clGetDeviceIDs it leaves 'device' pointer untouched if it cant find a device (tested). Yet, it could simply set the pointer to point to NULL. There might be other calls

                                               

                                              I know this is a very small cosmetic problem which probably would effect very simple programs and beginners in OpenCL. But it would  helped to avoid a segmentation fault in clAmdBlas examples and cause them to spit out a meaningful error message... and it might be useful for beginners, especially for people who does not check if their OpenCL calls succeded or not    I think it shouldnt be a hassle to set a default if no device or platform found. What do you think?

                                      • Re: APPML segmentation faults
                                        yurtesen

                                        kknox wrote:

                                         

                                        I've run valgrind on our library before too, and saw many of the same warnings from libamdocl64.so.  I had the driver & runtime team take a look at them, and they verified that they were false positives. 

                                         

                                        I use valgrind often on a wide variety of codes. It is often so that valgrind does not print out warnings exactly when an unitialized variable is assigned, for example if I had uninitialized a and b variables, a=b; would not cause a warning. Then I can use c=a+3; d=10;  e=c+d;... But in the end if I try to do   if ( e > 10 ) this pops up a warning. I check my code and I see, e = c + d  so e was initialized... but in reality, it was initialized using products of unitialized variable way back from the beginning of the program.

                                         

                                        If you think valgrind is giving false positive, I would recommend making a test case and contacting valgrind authors. (the same way that I contact AMD when there is a problem ) It tracks if there was a write operation to a memory location so I feel it is difficult for valgrind to make such mistake. Anyway, it is just a suggestion..