AnsweredAssumed Answered

GIM log error: GFX engine hang detected when kill S7150 VM

Question asked by dhscq on Feb 18, 2019

Hi everyone,

I'm testing MxGPU technology with S7150 + SRIOV + FIO + qemu-kvm-ev-2.6.0 + CentOS7.4 + Kernel-4.14.37-4.el7.x86_64.

When I destroy(kill) a vm with virtual S7150 passthroughed, gim driver logs some warnings and errors like belows:

 

cat /var/log/messages | grep -E 'kernel|gim'

2019-02-19T11:00:14.670229+08:00 node137 kernel: gim warning:(wait_cmd_complete:1671) GFX engine hang detected
2019-02-19T11:00:14.670313+08:00 node137 kernel: gim error:(wait_cmd_complete:1681) wait_cmd_complete -- time out after 0.101874202 sec
2019-02-19T11:00:14.670349+08:00 node137 kernel: gim error:(wait_cmd_complete:1688) Cmd = 0x1, Status = 0xe
2019-02-19T11:00:14.670382+08:00 node137 kernel: gim error:(dump_gpu_status:1420) **** dump gpu status begin for struct adapter 4:00.00
2019-02-19T11:00:14.671216+08:00 node137 kernel: gim info:(check_base_addrs:1408) CP_MQD_BASE_ADDR = 0xf4:0f9ff000
2019-02-19T11:00:14.671272+08:00 node137 kernel: gim error:(dump_gpu_status:1427) CP Ring buffer is not empty,
2019-02-19T11:00:14.671307+08:00 node137 kernel: gim error:(dump_gpu_status:1428) RPTR = 0x00002188, WPTR = 0x00000000
2019-02-19T11:00:14.672509+08:00 node137 kernel: gim error:(dump_gpu_status:1430) When IDLE_GPU was sent RPTR = 0x00002188,#011WPTR = 0x00000000
2019-02-19T11:00:14.672591+08:00 node137 kernel: gim warning:(ring_is_empty:1272) CP_RB_WPTR (0x00000000) != CP_RB_RPTR (0x00002188)
2019-02-19T11:00:14.672634+08:00 node137 kernel: gim error:(dump_gpu_status:1434) At least one ring is active
2019-02-19T11:00:14.673205+08:00 node137 kernel: gim error:(dump_gpu_status:1457) mmGRBM_STATUS = 0xa0003028
2019-02-19T11:00:14.673244+08:00 node137 kernel: gim error:(dump_gpu_status:1460) mmGRBM_STATUS2 = 0x71000808
2019-02-19T11:00:14.674276+08:00 node137 kernel: gim error:(dump_gpu_status:1463) mmSRBM_STATUS = 0x20020040
2019-02-19T11:00:14.674328+08:00 node137 kernel: gim error:(dump_gpu_status:1466) mmSRBM_STATUS2 = 0x0
2019-02-19T11:00:14.674361+08:00 node137 kernel: gim error:(dump_gpu_status:1469) mmSDMA0_STATUS_REG = 0x46deed57
2019-02-19T11:00:14.675375+08:00 node137 kernel: gim error:(dump_gpu_status:1472) mmSDMA1_STATUS_REG = 0x46deed57
2019-02-19T11:00:14.675436+08:00 node137 kernel: gim error:(dump_gpu_status:1486) CP busy
2019-02-19T11:00:14.675470+08:00 node137 kernel: gim error:(dump_gpu_status:1491) RLC busy
2019-02-19T11:00:14.684185+08:00 node137 kernel: gim error:(dump_gpu_status:1521) CP busy
2019-02-19T11:00:14.684224+08:00 node137 kernel: gim error:(dump_gpu_status:1563) CP_CPF_STATUS = 0xb4000223
2019-02-19T11:00:14.684256+08:00 node137 kernel: gim error:(dump_gpu_status:1565) The write pointer has been updated and
2019-02-19T11:00:14.685472+08:00 node137 kernel: gim error:(dump_gpu_status:1566) the initiated work is still being processed
2019-02-19T11:00:14.685524+08:00 node137 kernel: gim error:(dump_gpu_status:1567) by the GFX pipe
2019-02-19T11:00:14.686127+08:00 node137 kernel: gim info:(check_me_cntl:1396) ME/PFP/CE running GPU dump
2019-02-19T11:00:14.686166+08:00 node137 kernel: gim error:(dump_gpu_status:1583) CP_CPF_BUSY_STAT = 0x00000002
2019-02-19T11:00:14.686198+08:00 node137 kernel: gim error:(dump_gpu_status:1588) **** dump gpu status end
2019-02-19T11:00:14.687105+08:00 node137 kernel: gim error:(world_switch:3005) Schedule VF1 to VF2 failed;Failure reason is 3, try to reset
2019-02-19T11:00:14.687147+08:00 node137 kernel: gim info:(gim_notify_reset_per_vf:4143) Notify reset to VF1
2019-02-19T11:00:14.687186+08:00 node137 kernel: gim info:(mailbox_update_index:836) write mmMAILBOX_INDEX: 0x1

 

Do these errors matters, or how to prevent these errors?

Outcomes