polarnick

Ubuntu + Vega: std::bad_alloc on libhsa-ext-finalize64 dynload

Discussion created by polarnick on Feb 7, 2018
Latest reply on Feb 28, 2018 by dipak

Hi!

 

My setup:

- Ubuntu 16.04 + RX Vega 56 (gfx900)

- Driver: amdgpu-pro-17.50-511655 (installed with amdgpu-pro-install -y --opencl=rocm)

 

1. My app works Ok until it calls ocl_init() in CLEW. But when it calls ocl_init() it leads to SIGSEGV:

 

Thread 1 "app" received signal SIGSEGV, Segmentation fault.

0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

(gdb) bt

#0  0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#1  0x00001554d598f016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#2  0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd88, env=env@entry=0x2f26b70) at dl-init.c:72

#3  0x000015555533e7cb in call_init (env=0x2f26b70, argv=0x7fffffffdd88, argc=1, l=<optimized out>) at dl-init.c:30

#4  _dl_init (main_map=main_map@entry=0x36ed6c0, argc=1, argv=0x7fffffffdd88, env=0x2f26b70) at dl-init.c:120

#5  0x00001555553438e2 in dl_open_worker (a=a@entry=0x7fffffff9d30) at dl-open.c:575

#6  0x000015555533e564 in _dl_catch_error (objname=objname@entry=0x7fffffff9d20, errstring=errstring@entry=0x7fffffff9d28, mallocedp=mallocedp@entry=0x7fffffff9d1f,

    operate=operate@entry=0x1555553434d0 <dl_open_worker>, args=args@entry=0x7fffffff9d30) at dl-error.c:187

#7  0x0000155555342da9 in _dl_open (file=0x36ed548 "libhsa-ext-finalize64.so.1", mode=-2147483647, caller_dlopen=0x1555509bcad2, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x2f26b70)

    at dl-open.c:660

#8  0x000015554f9d8f09 in dlopen_doit (a=a@entry=0x7fffffff9f60) at dlopen.c:66

#9  0x000015555533e564 in _dl_catch_error (objname=0x2eb26d0, errstring=0x2eb26d8, mallocedp=0x2eb26c8, operate=0x15554f9d8eb0 <dlopen_doit>, args=0x7fffffff9f60) at dl-error.c:187

#10 0x000015554f9d9571 in _dlerror_run (operate=operate@entry=0x15554f9d8eb0 <dlopen_doit>, args=args@entry=0x7fffffff9f60) at dlerror.c:163

#11 0x000015554f9d8fa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87

#12 0x00001555509bcad2 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#13 0x00001555509c2737 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#14 0x00001555509c7cdd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#15 0x00001555509aeb4a in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#16 0x00001555513fd1bd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#17 0x00001555513cf503 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#18 0x00001555513eb6b7 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#19 0x00001555513b79e2 in clIcdGetPlatformIDsKHR () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#20 0x00001554d6dfc1f2 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#21 0x00001554d6dfde82 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#22 0x00001554d6dfc6c1 in clGetPlatformIDs () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#23 0x000000000186ccd2 in OpenCLEnum::enumPlatforms (this=this@entry=0x7fffffffa4b0) at <opencl/enum.cpp>:39

... (see full stacktrace in attach - stracktrace1.txt)

 

After strace investigation I could simplify problem to this:

 

2. When my app launched in such way:

 

LD_PRELOAD=/opt/amdgpu-pro/lib/x86_64-linux-gnu/libGL.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsakmt.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1 LD_LIBRARY_PATH=$dirname:/usr/lib/x86_64-linux-gnu/debug/ ./app

 

It fails on launch:

 

terminate called after throwing an instance of 'std::bad_alloc'

  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.

0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt

#0  0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

#1  0x000015554c8d602a in __GI_abort () at abort.c:89

#2  0x000015554d449d6d in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95

#3  0x000015554d447bd6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47

#4  0x000015554d447c21 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57

#5  0x000015554d447e39 in __cxxabiv1::__cxa_throw (obj=0x155540000940, tinfo=0x15554d74eb20 <typeinfo for std::bad_alloc>, dest=0x15554d446010 <std::bad_alloc::~bad_alloc()>)

    at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:87

#6  0x000015554d4483dc in operator new (sz=18446744073709551608) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:54

#7  0x000015554fd4734c in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#8  0x000015554fd49a63 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#9  0x000015554fd3c016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#10 0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd48, env=env@entry=0x7fffffffdd58) at dl-init.c:72

#11 0x000015555533e7cb in call_init (env=0x7fffffffdd58, argv=0x7fffffffdd48, argc=1, l=<optimized out>) at dl-init.c:30

#12 _dl_init (main_map=0x155555555168, argc=1, argv=0x7fffffffdd48, env=0x7fffffffdd58) at dl-init.c:120

#13 0x000015555532ec6a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

#14 0x0000000000000001 in ?? ()

#15 0x00007fffffffe0c9 in ?? ()

#16 0x0000000000000000 in ?? ()

(see this stacktrace in attach - stacktrace2.txt)

 

So it seems that initialization of libhsa-ext-finalize64.so.1 leads to new(sz=18446744073709551608) and this leads to std::bad_alloc. If I remove libhsa-ext-finalize64.so.1 from LD_PRELOAD - than situation 2 became equal to situation 1.

 

Can I do anything else to investigate this behaviour? Is it known bug? Because new is called with sz=0xfffffffffffffff8 which is looks like uninitialized variable in driver, or decremented NULL value.

Attachments

Outcomes