4 Replies Latest reply on Feb 16, 2018 4:34 AM by dipak

    Ubuntu + Vega: std::bad_alloc on libhsa-ext-finalize64 dynload

    polarnick

      Hi!

       

      My setup:

      - Ubuntu 16.04 + RX Vega 56 (gfx900)

      - Driver: amdgpu-pro-17.50-511655 (installed with amdgpu-pro-install -y --opencl=rocm)

       

      1. My app works Ok until it calls ocl_init() in CLEW. But when it calls ocl_init() it leads to SIGSEGV:

       

      Thread 1 "app" received signal SIGSEGV, Segmentation fault.

      0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      (gdb) bt

      #0  0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      #1  0x00001554d598f016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      #2  0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd88, env=env@entry=0x2f26b70) at dl-init.c:72

      #3  0x000015555533e7cb in call_init (env=0x2f26b70, argv=0x7fffffffdd88, argc=1, l=<optimized out>) at dl-init.c:30

      #4  _dl_init (main_map=main_map@entry=0x36ed6c0, argc=1, argv=0x7fffffffdd88, env=0x2f26b70) at dl-init.c:120

      #5  0x00001555553438e2 in dl_open_worker (a=a@entry=0x7fffffff9d30) at dl-open.c:575

      #6  0x000015555533e564 in _dl_catch_error (objname=objname@entry=0x7fffffff9d20, errstring=errstring@entry=0x7fffffff9d28, mallocedp=mallocedp@entry=0x7fffffff9d1f,

          operate=operate@entry=0x1555553434d0 <dl_open_worker>, args=args@entry=0x7fffffff9d30) at dl-error.c:187

      #7  0x0000155555342da9 in _dl_open (file=0x36ed548 "libhsa-ext-finalize64.so.1", mode=-2147483647, caller_dlopen=0x1555509bcad2, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x2f26b70)

          at dl-open.c:660

      #8  0x000015554f9d8f09 in dlopen_doit (a=a@entry=0x7fffffff9f60) at dlopen.c:66

      #9  0x000015555533e564 in _dl_catch_error (objname=0x2eb26d0, errstring=0x2eb26d8, mallocedp=0x2eb26c8, operate=0x15554f9d8eb0 <dlopen_doit>, args=0x7fffffff9f60) at dl-error.c:187

      #10 0x000015554f9d9571 in _dlerror_run (operate=operate@entry=0x15554f9d8eb0 <dlopen_doit>, args=args@entry=0x7fffffff9f60) at dlerror.c:163

      #11 0x000015554f9d8fa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87

      #12 0x00001555509bcad2 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

      #13 0x00001555509c2737 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

      #14 0x00001555509c7cdd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

      #15 0x00001555509aeb4a in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

      #16 0x00001555513fd1bd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

      #17 0x00001555513cf503 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

      #18 0x00001555513eb6b7 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

      #19 0x00001555513b79e2 in clIcdGetPlatformIDsKHR () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

      #20 0x00001554d6dfc1f2 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

      #21 0x00001554d6dfde82 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

      #22 0x00001554d6dfc6c1 in clGetPlatformIDs () from /usr/local/cuda-8.0/lib64/libOpenCL.so

      #23 0x000000000186ccd2 in OpenCLEnum::enumPlatforms (this=this@entry=0x7fffffffa4b0) at <opencl/enum.cpp>:39

      ... (see full stacktrace in attach - stracktrace1.txt)

       

      After strace investigation I could simplify problem to this:

       

      2. When my app launched in such way:

       

      LD_PRELOAD=/opt/amdgpu-pro/lib/x86_64-linux-gnu/libGL.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsakmt.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1 LD_LIBRARY_PATH=$dirname:/usr/lib/x86_64-linux-gnu/debug/ ./app

       

      It fails on launch:

       

      terminate called after throwing an instance of 'std::bad_alloc'

        what():  std::bad_alloc

      Program received signal SIGABRT, Aborted.

      0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

      54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

      (gdb) bt

      #0  0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

      #1  0x000015554c8d602a in __GI_abort () at abort.c:89

      #2  0x000015554d449d6d in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95

      #3  0x000015554d447bd6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47

      #4  0x000015554d447c21 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57

      #5  0x000015554d447e39 in __cxxabiv1::__cxa_throw (obj=0x155540000940, tinfo=0x15554d74eb20 <typeinfo for std::bad_alloc>, dest=0x15554d446010 <std::bad_alloc::~bad_alloc()>)

          at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:87

      #6  0x000015554d4483dc in operator new (sz=18446744073709551608) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:54

      #7  0x000015554fd4734c in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      #8  0x000015554fd49a63 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      #9  0x000015554fd3c016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

      #10 0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd48, env=env@entry=0x7fffffffdd58) at dl-init.c:72

      #11 0x000015555533e7cb in call_init (env=0x7fffffffdd58, argv=0x7fffffffdd48, argc=1, l=<optimized out>) at dl-init.c:30

      #12 _dl_init (main_map=0x155555555168, argc=1, argv=0x7fffffffdd48, env=0x7fffffffdd58) at dl-init.c:120

      #13 0x000015555532ec6a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

      #14 0x0000000000000001 in ?? ()

      #15 0x00007fffffffe0c9 in ?? ()

      #16 0x0000000000000000 in ?? ()

      (see this stacktrace in attach - stacktrace2.txt)

       

      So it seems that initialization of libhsa-ext-finalize64.so.1 leads to new(sz=18446744073709551608) and this leads to std::bad_alloc. If I remove libhsa-ext-finalize64.so.1 from LD_PRELOAD - than situation 2 became equal to situation 1.

       

      Can I do anything else to investigate this behaviour? Is it known bug? Because new is called with sz=0xfffffffffffffff8 which is looks like uninitialized variable in driver, or decremented NULL value.