3 Replies Latest reply on Mar 20, 2017 10:19 AM by jstefanop

    AMDGPU-PRO Kernel Panic on 16.04 Ubuntu with Kabylake based system

    jstefanop

      So base 16.04 server install with latest 16.60.3 AMDGPU-PRO drivers is causing a kernal panic on a kabylake based system when anything OpenCL wise is attempted to be accessed(in this case clinfo is called). Below is a kernel dump of the issue. Same install works fine on a Haswell based system. Clinfo returns fine when the amd gpu is taken out of the kabylake system (RX470 in this case), and returns the opencl info of the kabylake GPU...so the issue is definitely with the AMDGPU-PRO drivers when trying to access the AMD GPU.

       

       

      [  106.745104] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8

      [  106.745133] IP: [<ffffffffc0238d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

      [  106.745154] PGD 273a0b067 PUD 26e9a6067 PMD 0

      [  106.745166] Oops: 0000 [#1] SMP

      [  106.745176] Modules linked in: cfg80211 x86_pkg_temp_thermal coretemp kvm_intel ipmi_ssif kvm irqbypass snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd input_leds soundcore mei_me hci_uart mei btbcm btqca btintel bluetooth 8250_fintek ipmi_msghandler intel_lpss_acpi intel_lpss shpchp acpi_power_meter mac_hid acpi_als kfifo_buf industrialio ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear amdkfd amd_iommu_v2 hid_apple amdgpu(OE) amdttm(OE) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ast ablk_helper ttm cryptd

      [  106.745371] amdkcl(OE) igb drm_kms_helper dca syscopyarea sysfillrect ptp sysimgblt fb_sys_fops pps_core i2c_algo_bit drm ahci usbhid libahci video i2c_hid pinctrl_sunrisepoint pinctrl_intel hid fjes

      [  106.745421] CPU: 1 PID: 1367 Comm: clinfo Tainted: G           OE   4.4.0-64-generic #85-Ubuntu

      [  106.745437] Hardware name: Supermicro Super Server/X11SSL(-F)/X11SSM, BIOS 2.0 01/06/2017

      [  106.745452] task: ffff88026fc44b00 ti: ffff8802742cc000 task.ti: ffff8802742cc000

      [  106.745465] RIP: 0010:[<ffffffffc0238d90>]  [<ffffffffc0238d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

      [  106.745486] RSP: 0018:ffff8802742cf890  EFLAGS: 00010246

      [  106.745496] RAX: 00000000024280c0 RBX: 0000000000000000 RCX: ffff88027305ca00

      [  106.745509] RDX: 0000000000000001 RSI: 0000000000000040 RDI: 0000000000000090

      [  106.745521] RBP: ffff8802742cf928 R08: ffff88027fc9a160 R09: 0000000000000000

      [  106.745534] R10: ffff88027305c800 R11: 0000000000000090 R12: ffff880273edf900

      [  106.745547] R13: ffff88027305c800 R14: ffff8802742cf8d8 R15: 0000000000000000

      [  106.745560] FS:  00007f5db1a4a740(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000

      [  106.745574] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

      [  106.745585] CR2: 00000000000000a8 CR3: 0000000273e02000 CR4: 00000000003406e0

      [  106.745598] Stack:

      [  106.745602] ffff8802742cf960 ffffffff811ecb69 ffff88026fd63600 000000018010000e

      [  106.745619] 0000000000000000 024280c000000040 0000000000000038 0000000000000000

      [  106.745635] 0000000000000040 ffffffff811ee7d8 ffff880277001400 ffffffffc023109b

      [  106.745652] Call Trace:

      [  106.745660] [<ffffffff811ecb69>] ? ___slab_alloc+0x1e9/0x470

      [  106.745672] [<ffffffff811ee7d8>] ? __kmalloc+0x208/0x250

      [  106.745684] [<ffffffffc023109b>] ? amdttm_dma_tt_init+0x6b/0xd0 [amdttm]

      [  106.745716] [<ffffffffc026717f>] amdgpu_ttm_tt_populate+0x6f/0x240 [amdgpu]

      [  106.745731] [<ffffffffc0230ae7>] amdttm_tt_bind+0x37/0x70 [amdttm]

      [  106.745744] [<ffffffffc0232e40>] ttm_bo_handle_move_mem+0x530/0x5a0 [amdttm]

      [  106.745758] [<ffffffffc0233d4a>] amdttm_bo_validate+0x13a/0x150 [amdttm]

      [  106.745772] [<ffffffffc0233f89>] amdttm_bo_init+0x229/0x430 [amdttm]

      [  106.745798] [<ffffffffc026ab07>] amdgpu_bo_create_restricted+0x217/0x530 [amdgpu]

      [  106.745821] [<ffffffffc026a2d0>] ? amdgpu_bo_gpu_offset+0x150/0x150 [amdgpu]

      [  106.745845] [<ffffffffc026b0cd>] amdgpu_bo_create+0xed/0x190 [amdgpu]

      [  106.745867] [<ffffffffc026f3b3>] amdgpu_gem_object_create+0x103/0x1b0 [amdgpu]

      [  106.745891] [<ffffffffc026f8dc>] amdgpu_gem_create_ioctl+0xac/0x1b0 [amdgpu]

      [  106.745911] [<ffffffffc009b752>] drm_ioctl+0x152/0x540 [drm]

      [  106.745933] [<ffffffffc026f830>] ? amdgpu_gem_object_close+0x120/0x120 [amdgpu]

      [  106.745948] [<ffffffff8119fd07>] ? lru_cache_add_active_or_unevictable+0x27/0xa0

      [  106.746549] [<ffffffffc025504c>] amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]

      [  106.747140] [<ffffffff81222b5f>] do_vfs_ioctl+0x29f/0x490

      [  106.747731] [<ffffffff8106b514>] ? __do_page_fault+0x1b4/0x400

      [  106.748325] [<ffffffff81222dc9>] SyS_ioctl+0x79/0x90

      [  106.748924] [<ffffffff8183c5f2>] entry_SYSCALL_64_fastpath+0x16/0x71

      [  106.749515] Code: 01 19 c0 4e 8d 9c 3b 90 00 00 00 25 00 80 ff ff 05 c0 80 42 02 4d 85 db 89 45 94 0f 84 4f 02 00 00 49 01 df 4c 89 df 4c 89 4d 88 <41> 8b 87 a8 00 00 00 4c 89 5d 98 4c 89 75 b0 4c 89 75 b8 89 45

      [  106.750794] RIP  [<ffffffffc0238d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

      [  106.751422] RSP <ffff8802742cf890>

      [  106.752242] CR2: 00000000000000a8

      [  106.752863] ---[ end trace 912d1e00331fc37d ]---

        • Re: AMDGPU-PRO Kernel Panic on 16.04 Ubuntu with Kabylake based system
          amdev

          I have similar problem, but the problem is not related to a certain type of CPU. It occurs randomly when running OpenCL-based app or 'clinfo' command. The problem may or may not disappear after turning PC off/on.
          OS is Ubuntu 16.04, kernel:

          $ uname -a

          Linux uminer 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

          AMDGPU-PRO: 16.60-379184. Graphics card: RX480.
          kernel log:

          Mar  7 03:11:15 uminer kernel: [   77.557747] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8

          Mar  7 03:11:15 uminer kernel: [   77.557949] IP: [<ffffffffc01b5d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.558997] PGD a2fde067 PUD a6888067 PMD 0

          Mar  7 03:11:15 uminer kernel: [   77.560262] Oops: 0000 [#3] SMP

          Mar  7 03:11:15 uminer kernel: [   77.561499] Modules linked in: drbg ansi_cprng xts gf128mul dm_crypt snd_hda_codec_via snd_hda_codec_generic snd_hda_codec_hdmi kvm_amd snd_hda_intel kvm snd_hda_codec snd_hda_core snd_hwdep edac_mce_amd snd_pcm irqbypass serio_raw edac_core snd_timer k10temp snd soundcore shpchp i2c_piix4 8250_fintek mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear pata_acpi amdkfd amd_iommu_v2 amdgpu(OE) amdttm(OE) nouveau mxm_wmi video ttm amdkcl(OE) i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci psmouse pata_atiixp libahci r8169 mii drm wmi fjes

          Mar  7 03:11:15 uminer kernel: [   77.571011] CPU: 1 PID: 1371 Comm: clinfo Tainted: G      D    OE   4.4.0-62-generic #83-Ubuntu

          Mar  7 03:11:15 uminer kernel: [   77.572393] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./M3A770DE , BIOS P1.80 08/25/2011

          Mar  7 03:11:15 uminer kernel: [   77.573774] task: ffff880151075940 ti: ffff8800a7734000 task.ti: ffff8800a7734000

          Mar  7 03:11:15 uminer kernel: [   77.575147] RIP: 0010:[<ffffffffc01b5d90>]  [<ffffffffc01b5d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.576554] RSP: 0018:ffff8800a7737890  EFLAGS: 00010246

          Mar  7 03:11:15 uminer kernel: [   77.577951] RAX: 00000000024280c0 RBX: 0000000000000000 RCX: ffff8800a7756a00

          Mar  7 03:11:15 uminer kernel: [   77.579333] RDX: 0000000000000001 RSI: 0000000000000040 RDI: 0000000000000090

          Mar  7 03:11:15 uminer kernel: [   77.580731] RBP: ffff8800a7737928 R08: ffff880157c5a160 R09: 0000000000000000

          Mar  7 03:11:15 uminer kernel: [   77.582128] R10: ffff8800a7756800 R11: 0000000000000090 R12: ffff8800a228f900

          Mar  7 03:11:15 uminer kernel: [   77.583513] R13: ffff8800a7756800 R14: ffff8800a77378d8 R15: 0000000000000000

          Mar  7 03:11:15 uminer kernel: [   77.584932] FS:  00007ffa26f6a740(0000) GS:ffff880157c40000(0000) knlGS:0000000000000000

          Mar  7 03:11:15 uminer kernel: [   77.586323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

          Mar  7 03:11:15 uminer kernel: [   77.587734] CR2: 00000000000000a8 CR3: 00000000a7717000 CR4: 00000000000006e0

          Mar  7 03:11:15 uminer kernel: [   77.589165] Stack:

          Mar  7 03:11:15 uminer kernel: [   77.590590]  0000000000000000 0000000000000001 ffff88009ae1de40 00000001810ac572

          Mar  7 03:11:15 uminer kernel: [   77.592052]  0000000000000000 024280c000000040 0000000000000000 0000000000000000

          Mar  7 03:11:15 uminer kernel: [   77.593484]  0000000000000040 ffffffff811ee468 ffff880153001400 ffffffffc01ae09b

          Mar  7 03:11:15 uminer kernel: [   77.594950] Call Trace:

          Mar  7 03:11:15 uminer kernel: [   77.596400]  [<ffffffff811ee468>] ? __kmalloc+0x208/0x250

          Mar  7 03:11:15 uminer kernel: [   77.597831]  [<ffffffffc01ae09b>] ? amdttm_dma_tt_init+0x6b/0xd0 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.599374]  [<ffffffffc039717f>] amdgpu_ttm_tt_populate+0x6f/0x240 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.600748]  [<ffffffffc01adae7>] amdttm_tt_bind+0x37/0x70 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.602194]  [<ffffffffc01afe40>] ttm_bo_handle_move_mem+0x530/0x5a0 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.603681]  [<ffffffffc01b0d4a>] amdttm_bo_validate+0x13a/0x150 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.605136]  [<ffffffffc01b0f89>] amdttm_bo_init+0x229/0x430 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.606648]  [<ffffffffc039ab07>] amdgpu_bo_create_restricted+0x217/0x530 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.608087]  [<ffffffffc039a2d0>] ? amdgpu_bo_gpu_offset+0x150/0x150 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.609556]  [<ffffffffc039b0cd>] amdgpu_bo_create+0xed/0x190 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.611029]  [<ffffffffc039f3b3>] amdgpu_gem_object_create+0x103/0x1b0 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.612484]  [<ffffffffc039f8dc>] amdgpu_gem_create_ioctl+0xac/0x1b0 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.613964]  [<ffffffffc0035752>] drm_ioctl+0x152/0x540 [drm]

          Mar  7 03:11:15 uminer kernel: [   77.615381]  [<ffffffffc039f830>] ? amdgpu_gem_object_close+0x120/0x120 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.616734]  [<ffffffff8119fb17>] ? lru_cache_add_active_or_unevictable+0x27/0xa0

          Mar  7 03:11:15 uminer kernel: [   77.618108]  [<ffffffffc038504c>] amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]

          Mar  7 03:11:15 uminer kernel: [   77.619443]  [<ffffffff812227af>] do_vfs_ioctl+0x29f/0x490

          Mar  7 03:11:15 uminer kernel: [   77.620806]  [<ffffffff8106b514>] ? __do_page_fault+0x1b4/0x400

          Mar  7 03:11:15 uminer kernel: [   77.622155]  [<ffffffff818344c5>] ? schedule+0x35/0x80

          Mar  7 03:11:15 uminer kernel: [   77.623487]  [<ffffffff81222a19>] SyS_ioctl+0x79/0x90

          Mar  7 03:11:15 uminer kernel: [   77.624801]  [<ffffffff818385f2>] entry_SYSCALL_64_fastpath+0x16/0x71

          Mar  7 03:11:15 uminer kernel: [   77.626120] Code: 01 19 c0 4e 8d 9c 3b 90 00 00 00 25 00 80 ff ff 05 c0 80 42 02 4d 85 db 89 45 94 0f 84 4f 02 00 00 49 01 df 4c 89 df 4c 89 4d 88 <41> 8b 87 a8 00 00 00 4c 89 5d 98 4c 89 75 b0 4c 89 75 b8 89 45

          Mar  7 03:11:15 uminer kernel: [   77.629245] RIP  [<ffffffffc01b5d90>] amdttm_pool_populate+0x110/0x5c0 [amdttm]

          Mar  7 03:11:15 uminer kernel: [   77.630664]  RSP <ffff8800a7737890>

          Mar  7 03:11:15 uminer kernel: [   77.632034] CR2: 00000000000000a8

          Mar  7 03:11:15 uminer kernel: [   77.633416] ---[ end trace 87dcb04e66e31c12 ]---

          Can provide more info on demand.

          • Re: AMDGPU-PRO Kernel Panic on 16.04 Ubuntu with Kabylake based system
            cedarlug

            I appear to have the same issue with a particular R290X .  The kernel dereference error is the same with my setup (Ubuntu 16.04 & amdgpu 16.60) .

            • Re: AMDGPU-PRO Kernel Panic on 16.04 Ubuntu with Kabylake based system
              jstefanop

              So looks like I have isolated my issue to the ASPEED AST kernel driver for the onboard graphics for this particular motherboard. Blacklisting AST kernel module fixes this issue, but you obviously loose on-board graphics. Seems like the AMDGPU-Pro OpenCL libraries are trying to query this GPU as well which is causing the panic.