Hi all,
When trying to run this static binary of HPL from AMD in multiple nodes the application segfaults with SIGSEGV 11 (Segmentation Fault):
[root@genoa-n01 genoa]# mpirun --allow-run-as-root --hostfile hostfile --map-by socket:PE=32 --bind-to core -N 2 -np 4 -x OMP_NUM_THREADS=32 ./xhpl
[genoa-n02:14896:0:14896] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
[genoa-n02:14895:0:14895] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid: 14895) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
=================================
[genoa-n02:14895] *** Process received signal ***
[genoa-n02:14895] Signal: Segmentation fault (11)
[genoa-n02:14895] Signal code: (-6)
[genoa-n02:14895] Failing at address: 0x3a2f
[genoa-n02:14895] [ 0] /lib64/libpthread.so.0(+0x12d20)[0x7f028fc1ed20]
[genoa-n02:14895] [ 1] libnuma.so.1(numa_get_run_node_mask+0xbf)[0x7f02909065af]
[genoa-n02:14895] [ 2] ./xhpl(+0xe7aaf)[0x55796edb0aaf]
[genoa-n02:14895] [ 3] ./xhpl(+0x114983)[0x55796eddd983]
[genoa-n02:14895] [ 4] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f028f8707e5]
[genoa-n02:14895] [ 5] ./xhpl(+0xe53ae)[0x55796edae3ae]
[genoa-n02:14895] *** End of error message ***
==== backtrace (tid: 14896) ====
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
=================================
[genoa-n02:14896] *** Process received signal ***
[genoa-n02:14896] Signal: Segmentation fault (11)
[genoa-n02:14896] Signal code: (-6)
[genoa-n02:14896] Failing at address: 0x3a30
[genoa-n02:14896] [ 0] /lib64/libpthread.so.0(+0x12d20)[0x7f22c66fad20]
[genoa-n02:14896] [ 1] libnuma.so.1(numa_get_run_node_mask+0xbf)[0x7f22c73e25af]
[genoa-n02:14896] [ 2] ./xhpl(+0xe7aaf)[0x559fed2d7aaf]
[genoa-n02:14896] [ 3] ./xhpl(+0x114983)[0x559fed304983]
[genoa-n02:14896] [ 4] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f22c634c7e5]
[genoa-n02:14896] [ 5] ./xhpl(+0xe53ae)[0x559fed2d53ae]
[genoa-n02:14896] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 14896 on node genoa-n02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
It's worth noting that local node runs without any issues after enabling AOCC and AOCL.
CPU is AMD EPYC 9334 32-Core Processor.
Fabric is Infiniband NDR200.
System is running RHEL 8.10 with the following packages:
```
aocc-compiler-5.0.0-1.x86_64
aocl-linux-aocc-5.0.0-1.x86_64
doca-ofed-2.9.1-0.1.9.x86_64
openmpi-4.1.7rc1-1.2410068.x86_64
ucx-1.18.0-1.2410068.x86_64
xpmem-2.7.4-1.2410068.rhel8u10.x86_64
ucx-xpmem-1.18.0-1.2410068.x86_64
```
AMD binary release is:
`amd-zen-hpl-2024_10_08.tar.gz`
Running `ldd` against `xhpl` shows that everything seems to be loaded correctly:
[root@genoa-n01 genoa]# ldd xhpl
linux-vdso.so.1 (0x00007ffe73fd0000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f8c637af000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f8c6342d000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f8c63229000)
libmpi.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libmpi.so.40 (0x00007f8c62efc000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f8c62cdc000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f8c62906000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8c6445a000)
libopen-rte.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-rte.so.40 (0x00007f8c6264f000)
libopen-pal.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-pal.so.40 (0x00007f8c6235f000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f8c620c3000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f8c61ebb000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f8c61cb7000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f8c61a9f000)
libevent_core-2.1.so.6 => /usr/lib64/libevent_core-2.1.so.6 (0x00007f8c61866000)
libevent_pthreads-2.1.so.6 => /usr/lib64/libevent_pthreads-2.1.so.6 (0x00007f8c61663000)
libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f8c61409000)
libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f8c611f1000)
libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f8c60d06000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f8c60ab3000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f8c608ab000)
libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f8c60680000)
libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f8c603fc000)
Apart from `libnuma.so.1` that since it's shipped in the same .tar.gz file from the binary release, but even after fixing LD_LIBRARY_PATH and it showing up as loaded on `ldd` it still crashes.
[root@genoa-n01 genoa]# LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH ldd xhpl
linux-vdso.so.1 (0x00007ffe752f0000)
libnuma.so.1 => /home/hpl/genoa/libnuma.so.1 (0x00007f08ccf3a000)
libm.so.6 => /usr/lib64/libm.so.6 (0x00007f08cbefb000)
libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f08cbcf7000)
libmpi.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libmpi.so.40 (0x00007f08cb9ca000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f08cb7aa000)
libc.so.6 => /usr/lib64/libc.so.6 (0x00007f08cb3d4000)
/lib64/ld-linux-x86-64.so.2 (0x00007f08ccd1c000)
libopen-rte.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-rte.so.40 (0x00007f08cb11d000)
libopen-pal.so.40 => /usr/mpi/gcc/openmpi-4.1.7rc1/lib64/libopen-pal.so.40 (0x00007f08cae2d000)
libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007f08cab91000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00007f08ca989000)
libutil.so.1 => /usr/lib64/libutil.so.1 (0x00007f08ca785000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00007f08ca56d000)
libevent_core-2.1.so.6 => /usr/lib64/libevent_core-2.1.so.6 (0x00007f08ca334000)
libevent_pthreads-2.1.so.6 => /usr/lib64/libevent_pthreads-2.1.so.6 (0x00007f08ca131000)
libmount.so.1 => /usr/lib64/libmount.so.1 (0x00007f08c9ed7000)
libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f08c9cbf000)
libcrypto.so.1.1 => /usr/lib64/libcrypto.so.1.1 (0x00007f08c97d4000)
libblkid.so.1 => /usr/lib64/libblkid.so.1 (0x00007f08c9581000)
libuuid.so.1 => /usr/lib64/libuuid.so.1 (0x00007f08c9379000)
libselinux.so.1 => /usr/lib64/libselinux.so.1 (0x00007f08c914e000)
libpcre2-8.so.0 => /usr/lib64/libpcre2-8.so.0 (0x00007f08c8eca000)
Anyways according to the crash it semms to be an issue with `libnuma.so.1` since it crashed on it:
```
0 0x0000000000012d20 __funlockfile() :0
1 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
2 0x00000000000095af numa_bitmask_isbitset() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:173
3 0x00000000000095af numa_get_run_node_mask_v2() /tmp/root/spack-stage/spack-stage-numactl-2.0.18-z6bevjx657k2c3sm3kznh5y6b37zayiq/spack-src/libnuma.c:1809
4 0x00000000000e7aaf HPL_setup_alloc() ???:0
5 0x0000000000114983 main() ???:0
6 0x000000000003a7e5 __libc_start_main() ???:0
7 0x00000000000e53ae _start() ???:0
```
What am I missing here? Why it works in single node mode, but when it spawns through more than one node it crashes?