cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

polarnick
Adept I

Ubuntu + Vega: std::bad_alloc on libhsa-ext-finalize64 dynload

Hi!

My setup:

- Ubuntu 16.04 + RX Vega 56 (gfx900)

- Driver: amdgpu-pro-17.50-511655 (installed with amdgpu-pro-install -y --opencl=rocm)

1. My app works Ok until it calls ocl_init() in CLEW. But when it calls ocl_init() it leads to SIGSEGV:

Thread 1 "app" received signal SIGSEGV, Segmentation fault.

0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

(gdb) bt

#0  0x00001554d599d0f8 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#1  0x00001554d598f016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#2  0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd88, env=env@entry=0x2f26b70) at dl-init.c:72

#3  0x000015555533e7cb in call_init (env=0x2f26b70, argv=0x7fffffffdd88, argc=1, l=<optimized out>) at dl-init.c:30

#4  _dl_init (main_map=main_map@entry=0x36ed6c0, argc=1, argv=0x7fffffffdd88, env=0x2f26b70) at dl-init.c:120

#5  0x00001555553438e2 in dl_open_worker (a=a@entry=0x7fffffff9d30) at dl-open.c:575

#6  0x000015555533e564 in _dl_catch_error (objname=objname@entry=0x7fffffff9d20, errstring=errstring@entry=0x7fffffff9d28, mallocedp=mallocedp@entry=0x7fffffff9d1f,

    operate=operate@entry=0x1555553434d0 <dl_open_worker>, args=args@entry=0x7fffffff9d30) at dl-error.c:187

#7  0x0000155555342da9 in _dl_open (file=0x36ed548 "libhsa-ext-finalize64.so.1", mode=-2147483647, caller_dlopen=0x1555509bcad2, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x2f26b70)

    at dl-open.c:660

#8  0x000015554f9d8f09 in dlopen_doit (a=a@entry=0x7fffffff9f60) at dlopen.c:66

#9  0x000015555533e564 in _dl_catch_error (objname=0x2eb26d0, errstring=0x2eb26d8, mallocedp=0x2eb26c8, operate=0x15554f9d8eb0 <dlopen_doit>, args=0x7fffffff9f60) at dl-error.c:187

#10 0x000015554f9d9571 in _dlerror_run (operate=operate@entry=0x15554f9d8eb0 <dlopen_doit>, args=args@entry=0x7fffffff9f60) at dlerror.c:163

#11 0x000015554f9d8fa1 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87

#12 0x00001555509bcad2 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#13 0x00001555509c2737 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#14 0x00001555509c7cdd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#15 0x00001555509aeb4a in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1

#16 0x00001555513fd1bd in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#17 0x00001555513cf503 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#18 0x00001555513eb6b7 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#19 0x00001555513b79e2 in clIcdGetPlatformIDsKHR () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so

#20 0x00001554d6dfc1f2 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#21 0x00001554d6dfde82 in ?? () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#22 0x00001554d6dfc6c1 in clGetPlatformIDs () from /usr/local/cuda-8.0/lib64/libOpenCL.so

#23 0x000000000186ccd2 in OpenCLEnum::enumPlatforms (this=this@entry=0x7fffffffa4b0) at <opencl/enum.cpp>:39

... (see full stacktrace in attach - stracktrace1.txt)

After strace investigation I could simplify problem to this:

2. When my app launched in such way:

LD_PRELOAD=/opt/amdgpu-pro/lib/x86_64-linux-gnu/libGL.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl-rocr64.so:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-runtime64.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsakmt.so.1:/opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1 LD_LIBRARY_PATH=$dirname:/usr/lib/x86_64-linux-gnu/debug/ ./app

It fails on launch:

terminate called after throwing an instance of 'std::bad_alloc'

  what():  std::bad_alloc

Program received signal SIGABRT, Aborted.

0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt

#0  0x000015554c8d4428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54

#1  0x000015554c8d602a in __GI_abort () at abort.c:89

#2  0x000015554d449d6d in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95

#3  0x000015554d447bd6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:47

#4  0x000015554d447c21 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:57

#5  0x000015554d447e39 in __cxxabiv1::__cxa_throw (obj=0x155540000940, tinfo=0x15554d74eb20 <typeinfo for std::bad_alloc>, dest=0x15554d446010 <std::bad_alloc::~bad_alloc()>)

    at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:87

#6  0x000015554d4483dc in operator new (sz=18446744073709551608) at ../../../../src/libstdc++-v3/libsupc++/new_op.cc:54

#7  0x000015554fd4734c in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#8  0x000015554fd49a63 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#9  0x000015554fd3c016 in ?? () from /opt/amdgpu-pro/lib/x86_64-linux-gnu/libhsa-ext-finalize64.so.1

#10 0x000015555533e6ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdd48, env=env@entry=0x7fffffffdd58) at dl-init.c:72

#11 0x000015555533e7cb in call_init (env=0x7fffffffdd58, argv=0x7fffffffdd48, argc=1, l=<optimized out>) at dl-init.c:30

#12 _dl_init (main_map=0x155555555168, argc=1, argv=0x7fffffffdd48, env=0x7fffffffdd58) at dl-init.c:120

#13 0x000015555532ec6a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2

#14 0x0000000000000001 in ?? ()

#15 0x00007fffffffe0c9 in ?? ()

#16 0x0000000000000000 in ?? ()

(see this stacktrace in attach - stacktrace2.txt)

So it seems that initialization of libhsa-ext-finalize64.so.1 leads to new(sz=18446744073709551608) and this leads to std::bad_alloc. If I remove libhsa-ext-finalize64.so.1 from LD_PRELOAD - than situation 2 became equal to situation 1.

Can I do anything else to investigate this behaviour? Is it known bug? Because new is called with sz=0xfffffffffffffff8 which is looks like uninitialized variable in driver, or decremented NULL value.

0 Likes
7 Replies
polarnick
Adept I

When I switched my app compiler from GCC 4.7.4 to GCC 4.4.7 - the problem was gone.

Also I regularly encounter system hangs up (~once per hour on average. More often when I view SketchFab models, more rare when I don't stress GPU with OpenGL/OpenCL).

So in my case Ubuntu 16.04 + amdgpu-pro 17.50 + RX Vega 56 stack is unusable.

But the same computer with the same Vega works great with Windows 7 + Radeon™ Software Adrenalin Edition 18.2.1.

0 Likes

Hi Nikolay,

Thanks for reporting it. As the error coming from internal library, could you please share a repro so that our concerned team can investigate it at our end?

If you are concerned about privacy, you may send the package directly via email or other shared option.

Regards,

0 Likes

I sent you a reproducer via PM. Sorry for delay.

0 Likes

Thank you. I've opened a ticket and shared the repro for investigation.

0 Likes

Update:

As I've been informed by the concerned team, amdgpu-pro 17.50 does not yet support gcc v4.7. Hence the crash was observed.

Regards,

0 Likes

Thanks for update!

What gcc versions are supported?

How version of client app compiler affects driver? I mean this is not CUDA (that works not in a standard third-library way, but it has its own compiler for .cu files, compiling also host-side part of .cu files, and so compiler version of whole application is constrained). OpenCL provides great level of abstraction from driver, what gone wrong here?

0 Likes

Just to let you know, I've already forwarded your queries to the concerned team and asked them to share more information. Once I get their reply, I'll get back to you.

0 Likes