I am trying to install ROCm to use with my new w6800, and I am getting this error as soon as linux loads the amdgpu driver:
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 8
[Thu Jan 19 14:24:29 2023] amdgpu 0000:83:00.0: amdgpu: Using BACO for runtime pm
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: event severity: recoverable
[Thu Jan 19 14:24:29 2023] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: Error 0, type: fatal
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: section_type: PCIe error
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: port_type: 1, legacy PCI end point
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: version: 3.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: command: 0x0547, status: 0x4810
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: device_id: 0000:83:00.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: slot: 0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: secondary_bus: 0x00
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: vendor_id: 0x1002, device_id: 0x73a3
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: class_code: 000000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: aer_uncor_status: 0x00008000, aer_uncor_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]: TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_status: 0x00008000, aer_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: [15] CmpltAbrt (First)
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_layer=Transaction Layer, aer_agent=Completer ID
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: [drm] fb1: amdgpudrmfb frame buffer device
[Thu Jan 19 14:24:30 2023] [drm] PCI error: detected callback, state(2)!!
[Thu Jan 19 14:24:30 2023] snd_hda_intel 0000:83:00.1: AER: can't recover (no error_detected callback)
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000
[Thu Jan 19 14:24:30 2023] [drm:jpeg_v3_0_set_powergating_state.cold [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
[Thu Jan 19 14:24:30 2023] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v3_0> failed -110
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: Downstream Port link has been reset (0)
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: device recovery failed
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:34 param:0x00000001 message:SetWorkloadMask?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
[Thu Jan 19 14:24:33 2023] amdgpu 0000:83:00.0: amdgpu: Failed to disable gfxoff!
after this, anything that tries to talk to the GPU hangs indefinitely, including lspci, and the only way to get this semi functioning again is to blacklist the amdgpu module, and reboot.
Here is the output of lspci -vv for my gpu (before driver is loaded):
83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800] (prog-if 00 [VGA controller])
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0e1e
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 255
NUMA node: 1
IOMMU group: 72
Region 0: Memory at 71d90000000 (64-bit, prefetchable) [disabled] [size=256M]
Region 2: Memory at 71da0000000 (64-bit, prefetchable) [disabled] [size=2M]
Region 4: I/O ports at 5000 [disabled] [size=256]
Region 5: Memory at 99200000 (32-bit, non-prefetchable) [disabled] [size=1M]
Expansion ROM at 99320000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s (ok), Width x16 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-
AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
CEMsk: RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [200 v1] Physical Resizable BAR
BAR 0: current size: 256MB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
Capabilities: [240 v1] Power Budgeting <?>
Capabilities: [270 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [2d0 v1] Process Address Space ID (PASID)
PASIDCap: Exec+ Priv+, Max PASID Width: 10
PASIDCtl: Enable- Exec- Priv-
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
Capabilities: [440 v1] Lane Margining at the Receiver <?>
Kernel modules: amdgpu
83:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller
Solved! Go to Solution.
I solved this by disabling runpm, it seems the most recent release is missing some patches: https://gitlab.freedesktop.org/drm/amd/-/issues/2358
Not sure if AMD Moderator for Professional GPU cards @fsadough can assist you on this or not.
You can also open this thread at GITHUB ROCM FORUMs and asked them there:
https://github.com/RadeonOpenCompute/ROCm/discussions
https://community.amd.com/t5/rocm/ct-p/amd-rocm (This forum you may not get too many replies).
I solved this by disabling runpm, it seems the most recent release is missing some patches: https://gitlab.freedesktop.org/drm/amd/-/issues/2358
Thanks, I will try it as you said.