cancel
Showing results for 
Search instead for 
Did you mean: 

ROCm Discussions

gmwi9Q
Journeyman III

Problem on some MI60s after installing amdgpu/rocm


Hello
I wonder if anything can be done about the problem below.

Briefly, I have three MI60 (bought as new) and a used MI50 (from private
seller.) Each one is successfully mounted on an HP server (details below)
without an OS; every card appears correctly on the server hardware inventory.

After installing Ubuntu 22.04.4 on the server and followed by the installation
of rocm, only two of the cards work correctly (one MI60 and the MI50.)

When either of the working cards is replaced with one of the two remaining MI60s,
this happens during boot:

---- cut-n-paste
Απρ 08 14:45:37 aa-ML350 kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Απρ 08 14:45:37 aa-ML350 kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 4EC8 (len 74, WS 0, >
Απρ 08 14:45:37 aa-ML350 kernel: amdgpu 0000:8c:00.0: amdgpu: gpu post error!
Απρ 08 14:45:37 aa-ML350 kernel: amdgpu 0000:8c:00.0: amdgpu: Fatal error during GPU init
Απρ 08 14:45:37 aa-ML350 kernel: amdgpu 0000:8c:00.0: amdgpu: amdgpu: finishing device.
Απρ 08 14:45:37 aa-ML350 kernel: x86/PAT: kworker/7:9:245 freeing invalid memtype [mem 0x00000000-0xffffffffffffffff]
Απρ 08 14:45:37 aa-ML350 kernel: amdgpu: probe of 0000:8c:00.0 failed with error -22
---- end

I have tried different pci slots.
Are additional info or tests needed?
Any ideas?
Thank you.
--

Server: HP ML350 G9 (pcie gen 3)
OS: Ubuntu 22.04.4 Server
amdgpu from: apt-get install ./amdgpu-install_5.4.50403-1_all.deb
usecases: amdgpu-install --usecase=rocm,opencl,openclsdk,hip,hiplibsdk

0 Likes
0 Replies