cancel
Showing results for 
Search instead for 
Did you mean: 

Server Processors

ferrao
Journeyman III

Cannot run AMD’s optimized version of HPL in more than one host

Hi all,

 

 

When trying to run this static binary of HPL from AMD in multiple nodes the application segfaults with SIGSEGV 11 (Segmentation Fault):

 

[root@genoa-n01 genoa]# mpirun --allow-run-as-root --hostfile hostfile --map-by socket:PE=32 --bind-to core -N 2 -np 4 -x OMP_NUM_THREADS=32 ./xhpl
[genoa-n02:14896:0:14896] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[genoa-n02:14895:0:14895] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 14895) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
=================================
[genoa-n02:14895] *** Process received signal ***
[genoa-n02:14895] Signal: Segmentation fault (11)
[genoa-n02:14895] Signal code: (-6)
[genoa-n02:14895] Failing at address: 0x3a2f
[genoa-n02:14895] [ 0] /lib64/libpthread.so.0(+0x12d20)[0x7f028fc1ed20]
[genoa-n02:14895] [ 1] libnuma.so.1(numa_get_run_node_mask+0xbf)[0x7f02909065af]
[genoa-n02:14895] [ 2] ./xhpl(+0xe7aaf)[0x55796edb0aaf]
[genoa-n02:14895] [ 3] ./xhpl(+0x114983)[0x55796eddd983]
[genoa-n02:14895] [ 4] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f028f8707e5]
[genoa-n02:14895] [ 5] ./xhpl(+0xe53ae)[0x55796edae3ae]
[genoa-n02:14895] *** End of error message ***
==== backtrace (tid: 14896) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
=================================
[genoa-n02:14896] *** Process received signal ***
[genoa-n02:14896] Signal: Segmentation fault (11)
[genoa-n02:14896] Signal code: (-6)
[genoa-n02:14896] Failing at address: 0x3a30
[genoa-n02:14896] [ 0] /lib64/libpthread.so.0(+0x12d20)[0x7f22c66fad20]
[genoa-n02:14896] [ 1] libnuma.so.1(numa_get_run_node_mask+0xbf)[0x7f22c73e25af]
[genoa-n02:14896] [ 2] ./xhpl(+0xe7aaf)[0x559fed2d7aaf]
[genoa-n02:14896] [ 3] ./xhpl(+0x114983)[0x559fed304983]
[genoa-n02:14896] [ 4] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f22c634c7e5]
[genoa-n02:14896] [ 5] ./xhpl(+0xe53ae)[0x559fed2d53ae]
[genoa-n02:14896] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 14896 on node genoa-n02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

 

It's worth noting that local node runs without any issues after enabling AOCC and AOCL.

 

CPU is AMD EPYC 9334 32-Core Processor.
Fabric is Infiniband NDR200.

 

System is running RHEL 8.10 with the following packages:
```
aocc-compiler-5.0.0-1.x86_64
aocl-linux-aocc-5.0.0-1.x86_64
doca-ofed-2.9.1-0.1.9.x86_64
openmpi-4.1.7rc1-1.2410068.x86_64
ucx-1.18.0-1.2410068.x86_64
xpmem-2.7.4-1.2410068.rhel8u10.x86_64
ucx-xpmem-1.18.0-1.2410068.x86_64
```

 

AMD binary release is:
`amd-zen-hpl-2024_10_08.tar.gz`

 

Running `ldd` against `xhpl` shows that everything seems to be loaded correctly:
[root@genoa-n01 genoa]# ldd xhpl
linux-vdso.so.1 (0x00007ffe73fd0000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f8c637af000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f8c6342d000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f8c63229000)
libmpi.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libmpi.so.40 (0x00007f8c62efc000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f8c62cdc000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f8c62906000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8c6445a000)
libopen-rte.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-rte.so.40 (0x00007f8c6264f000)
libopen-pal.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-pal.so.40 (0x00007f8c6235f000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f8c620c3000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f8c61ebb000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f8c61cb7000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f8c61a9f000)
libevent_core-2.1.so.6 => /usr/lib64/libevent_core-2.1.so.6 (0x00007f8c61866000)
libevent_pthreads-2.1.so.6 => /usr/lib64/libevent_pthreads-2.1.so.6 (0x00007f8c61663000)
libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f8c61409000)
libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f8c611f1000)
libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f8c60d06000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f8c60ab3000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f8c608ab000)
libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f8c60680000)
libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f8c603fc000)

 

Apart from `libnuma.so.1` that since it's shipped in the same .tar.gz file from the binary release, but even after fixing LD_LIBRARY_PATH and it showing up as loaded on `ldd` it still crashes.
[root@genoa-n01 genoa]# LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH ldd xhpl
linux-vdso.so.1 (0x00007ffe752f0000)
libnuma.so.1 => /home/hpl/genoa/libnuma.so.1 (0x00007f08ccf3a000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f08cbefb000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f08cbcf7000)
libmpi.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libmpi.so.40 (0x00007f08cb9ca000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f08cb7aa000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f08cb3d4000)
/lib64/ld-linux-x86-64.so.2 (0x00007f08ccd1c000)
libopen-rte.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-rte.so.40 (0x00007f08cb11d000)
libopen-pal.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-pal.so.40 (0x00007f08cae2d000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f08cab91000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f08ca989000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f08ca785000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f08ca56d000)
libevent_core-2.1.so.6 => /usr/lib64/libevent_core-2.1.so.6 (0x00007f08ca334000)
libevent_pthreads-2.1.so.6 => /usr/lib64/libevent_pthreads-2.1.so.6 (0x00007f08ca131000)
libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f08c9ed7000)
libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f08c9cbf000)
libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f08c97d4000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f08c9581000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f08c9379000)
libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f08c914e000)
libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f08c8eca000)

 

Anyways according to the crash it semms to be an issue with `libnuma.so.1` since it crashed on it:

 

```
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
```



What am I missing here? Why it works in single node mode, but when it spawns through more than one node it crashes?
0 Likes
1 Reply
ajayrant
Staff

Hi @ferrao 

Thanks for writing to serverguru forum

Currently we are investigating your issue at our end, we will keep you updated about the same

 

Thanks & Regards
Ajay

0 Likes