cancel
Showing results for 
Search instead for 
Did you mean: 

ROCm Discussions

JoJoMan
Adept I

w6800 throws hardware error on amgpu load

I am trying to install ROCm to use with my new w6800, and I am getting this error as soon as linux loads the amdgpu driver:

 

 

[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 8
[Thu Jan 19 14:24:29 2023] amdgpu 0000:83:00.0: amdgpu: Using BACO for runtime pm
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]: event severity: recoverable
[Thu Jan 19 14:24:29 2023] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:  Error 0, type: fatal
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   section_type: PCIe error
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   port_type: 1, legacy PCI end point
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   version: 3.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   command: 0x0547, status: 0x4810
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   device_id: 0000:83:00.0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   slot: 0
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   secondary_bus: 0x00
[Thu Jan 19 14:24:29 2023] {1}[Hardware Error]:   vendor_id: 0x1002, device_id: 0x73a3
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]:   class_code: 000000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]:   aer_uncor_status: 0x00008000, aer_uncor_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]:   aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] {1}[Hardware Error]:   TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_status: 0x00008000, aer_mask: 0x00010000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0:    [15] CmpltAbrt              (First)
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_layer=Transaction Layer, aer_agent=Completer ID
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER: aer_uncor_severity: 0x004ef030
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: AER:   TLP Header: 00009001 8000220f 99269934 00000000
[Thu Jan 19 14:24:30 2023] amdgpu 0000:83:00.0: [drm] fb1: amdgpudrmfb frame buffer device
[Thu Jan 19 14:24:30 2023] [drm] PCI error: detected callback, state(2)!!
[Thu Jan 19 14:24:30 2023] snd_hda_intel 0000:83:00.1: AER: can't recover (no error_detected callback)
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_PGFSM_STATUS] failed to reach value 0x00800000 != 0x00c00000
[Thu Jan 19 14:24:30 2023] [drm:jpeg_v3_0_set_powergating_state.cold [amdgpu]] *ERROR* amdgpu: JPEG enable power gating failed
[Thu Jan 19 14:24:30 2023] [drm:amdgpu_device_ip_set_powergating_state [amdgpu]] *ERROR* set_powergating_state of IP block <jpeg_v3_0> failed -110
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:30 2023] [drm] Register(0) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(0) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: Downstream Port link has been reset (0)
[Thu Jan 19 14:24:31 2023] pcieport 0000:82:00.0: AER: device recovery failed
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_RBC_RB_RPTR] failed to reach value 0x7fffffff != 0xffffffff
[Thu Jan 19 14:24:31 2023] [drm] Register(1) [mmUVD_POWER_STATUS] failed to reach value 0x00000001 != 0x00000003
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:31 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:34 param:0x00000001 message:SetWorkloadMask?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:13 param:0x00000000 message:GetEnabledSmuFeaturesHigh?
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
[Thu Jan 19 14:24:32 2023] amdgpu 0000:83:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:41 param:0x00000000 message:DisallowGfxOff?
[Thu Jan 19 14:24:33 2023] amdgpu 0000:83:00.0: amdgpu: Failed to disable gfxoff!

 

 

 

 
after this, anything that tries to talk to the GPU hangs indefinitely, including lspci, and the only way to get this semi functioning again is to blacklist the amdgpu module, and reboot.

Here is the output of lspci -vv for my gpu (before driver is loaded):

 

 

 

83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21 GL-XL [Radeon PRO W6800] (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0e1e
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 255
	NUMA node: 1
	IOMMU group: 72
	Region 0: Memory at 71d90000000 (64-bit, prefetchable) [disabled] [size=256M]
	Region 2: Memory at 71da0000000 (64-bit, prefetchable) [disabled] [size=2M]
	Region 4: I/O ports at 5000 [disabled] [size=256]
	Region 5: Memory at 99200000 (32-bit, non-prefetchable) [disabled] [size=1M]
	Expansion ROM at 99320000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr- BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [200 v1] Physical Resizable BAR
		BAR 0: current size: 256MB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB
		BAR 2: current size: 2MB, supported: 2MB 4MB 8MB 16MB 32MB 64MB 128MB 256MB
	Capabilities: [240 v1] Power Budgeting <?>
	Capabilities: [270 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [2a0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [2d0 v1] Process Address Space ID (PASID)
		PASIDCap: Exec+ Priv+, Max PASID Width: 10
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [320 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [410 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [440 v1] Lane Margining at the Receiver <?>
	Kernel modules: amdgpu

83:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Navi 21/23 HDMI/DP Audio Controller

 

 

 

0 Likes
0 Replies