cancel
Showing results for 
Search instead for 
Did you mean: 

Discussions

jordan44665
Journeyman III

Cannot run with 2 W6800 GPUs on Linux 5.15.0-67 kernel

I am trying to install two W6800 GPus with Linuxkernel 5.15.0-67. I have installed ROCm and HIP SDK and it works with one W6800 but when I add another W6800, the kernel fails. Here are the last few lines of dmesg output.

-----------------------------------------------------------------------------------

[ 84.066981] RIP: 0010:hubbub2_get_dchub_ref_freq+0xa3/0xc0 [amdgpu]
[ 84.067715] dcn30_init_hw+0x60a/0x980 [amdgpu]
[ 84.068379] ? amdgpu_dm_dmub_reg_read+0x23/0x30 [amdgpu]
[ 84.069050] dc_set_power_state+0x120/0x180 [amdgpu]
[ 84.069743] dm_resume+0xe0/0x890 [amdgpu]
[ 84.070407] amdgpu_device_ip_resume_phase2+0xca/0x200 [amdgpu]
[ 84.070736] amdgpu_device_resume+0xbf/0x230 [amdgpu]
[ 84.071067] amdgpu_pmops_runtime_resume+0x92/0xf0 [amdgpu]
[ 84.194075] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.306116] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.418093] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.530101] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.642206] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.754270] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[ 84.865205] amdgpu 0000:b3:00.0: amdgpu: rlc autoload: gc ucode autoload timeout
[ 84.865213] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v10_0> failed -110
[ 84.865620] amdgpu 0000:b3:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
---------------------------------------------------------------------------------------------

And rocminfo fails with this error:
--------------------------------------------------------------------
ROCk module is loaded
hsa api call failure at: /long_pathname_so_that_rpms_can_package_the_debug_info/src/rocminfo/rocminfo.cc:1140
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
---------------------------------------------------------------------------

Any ideas on how I can workaround this issue? Thanks.

0 Likes
0 Replies