Archives Discussions

mjharvey · ‎01-13-2010

Driver bug?

Hi,

I have a freshly installed Ubuntu 9.04 64b system (2.6.28-17-generic #58), with the 1.0 SDK and the hotfix driver.

When I run my application, it executes several kernels correctly but then seems to get stuck when running a particular kernel. The host process remains busy inside a thread spawned from the ocl runtime.

Eventually the kernel execution terminates (without returning an error to the host app) and the attached trace is printed to the kernel log. The program then runs three more kernels without error before attempting a memory copy form device to host.

At this point, the program locks up and becomes unkillable (once again, a thread spawned from the runtime is busy at 100% cpu). The X server is similarly locked and unkillable.

It's possible that there's a bug in that kernel (through it works ok with Nvidia OpenCL), but there's obviously a bug in the driver that's preventing recovery from the hang. Hopefully the stack trace will be useful for tracking down the fault.

MJH

13383.095745] fglrx_pci 0000:01:00.0: irq 2300 for MSI/MSI-X [13383.096308] [fglrx] Firegl kernel thread PID: 7701 [13383.436729] [fglrx] Gart USWC size:619 M. [13383.436732] [fglrx] Gart cacheable size:244 M. [13383.436736] [fglrx] Reserved FB block: Shared offset:0, size:1000000 [13383.436738] [fglrx] Reserved FB block: Unshared offset:fb0d000, size:1f3000 [13383.436739] [fglrx] Reserved FB block: Unshared offset:3fffb000, size:5000 [13556.810544] fglrx_pci 0000:01:00.0: irq 2300 for MSI/MSI-X [13556.811104] [fglrx] Firegl kernel thread PID: 7948 [13557.152313] [fglrx] Gart USWC size:619 M. [13557.152315] [fglrx] Gart cacheable size:244 M. [13557.152320] [fglrx] Reserved FB block: Shared offset:0, size:1000000 [13557.152322] [fglrx] Reserved FB block: Unshared offset:fb0d000, size:1f3000 [13557.152324] [fglrx] Reserved FB block: Unshared offset:3fffb000, size:5000 [15110.614875] [fglrx] ASIC hang happened [15110.614881] Pid: 11963, comm: acemd.ocl Tainted: P 2.6.28-17-generic #58-Ubuntu [15110.614883] Call Trace: [15110.614943] [<ffffffffa00f23c9>] KCL_DEBUG_OsDump+0x9/0x10 [fglrx] [15110.614981] [<ffffffffa00ff3cc>] firegl_hardwareHangRecovery+0x1c/0x50 [fglrx] [15110.615037] [<ffffffffa0174b19>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx] [15110.615092] [<ffffffffa0174acc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x6c/0xb0 [fglrx] [15110.615147] [<ffffffffa0173b6d>] ? _ZN4Asic19PM4ElapsedTimeStampERK23PM4_TS_INTERRUPT_PARAMSj14_LARGE_INTEGER+0x18d/0x1c0 [fglrx] [15110.615153] [<ffffffff8069bab9>] ? _spin_lock+0x9/0x10 [15110.615194] [<ffffffffa011b142>] ? firegl_trace+0x72/0x1e0 [fglrx] [15110.615234] [<ffffffffa011b142>] ? firegl_trace+0x72/0x1e0 [fglrx] [15110.615291] [<ffffffffa016cb02>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampERK23PM4_TS_INTERRUPT_PARAMSj14_LARGE_INTEGER+0x32/0x50 [fglrx] [15110.615345] [<ffffffffa0166c13>] ? _Z19uQSTimeStampRetiredmjj14_LARGE_INTEGER+0xa3/0xb0 [fglrx] [15110.615397] [<ffffffffa01626f9>] ? _Z8uCWDDEQCmjjPvjS_+0x379/0x10c0 [fglrx] [15110.615439] [<ffffffffa011d5b4>] ? firegl_cmmqs_CWDDE_32+0x334/0x440 [fglrx] [15110.615479] [<ffffffffa011c060>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx] [15110.615484] [<ffffffff803f5f9c>] ? apparmor_capable+0x1c/0x70 [15110.615524] [<ffffffffa011bff0>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx] [15110.615560] [<ffffffffa00fb19a>] ? firegl_ioctl+0x1ea/0x250 [fglrx] [15110.615564] [<ffffffff8069bab9>] ? _spin_lock+0x9/0x10 [15110.615598] [<ffffffffa00f05c1>] ? ip_firegl_ioctl+0x11/0x20 [fglrx] [15110.615602] [<ffffffff802f66cd>] ? vfs_ioctl+0x7d/0xa0 [15110.615605] [<ffffffff802f6a35>] ? do_vfs_ioctl+0x75/0x230 [15110.615607] [<ffffffff802f6c89>] ? sys_ioctl+0x99/0xa0 [15110.615611] [<ffffffff8021253a>] ? system_call_fastpath+0x16/0x1b [15110.615615] pubdev:0xffffffffa0314dc0, num of device:1 , name:fglrx, major 8, minor 68. [15110.615617] device 0 : 0xffff88007c448000 . [15110.615619] Asic ID:0x6899, revision:0x2, MMIOReg:0xffffc200101c0000. [15110.615622] FB phys addr: 0xd0000000, MC :0xf00000000, Total FB size :0x40000000. [15110.615624] gart table MC:0xf0fb0d000, Physical:0xdfb0d000, size:0x1f2000. [15110.615627] mc_node :FB, total 1 zones [15110.615629] MC start:0xf00000000, Physical:0xd0000000, size:0xfd00000. [15110.615631] Mapped heap -- Offset:0x0, size:0xfb0d000, reference count:9, mapping count:0, [15110.615634] Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0, [15110.615637] Mapped heap -- Offset:0xfb0d000, size:0x1f3000, reference count:1, mapping count:0, [15110.615639] mc_node :INV_FB, total 1 zones [15110.615641] MC start:0xf0fd00000, Physical:0xdfd00000, size:0x30300000. [15110.615643] Mapped heap -- Offset:0x302fb000, size:0x5000, reference count:1, mapping count:0, [15110.615645] mc_node :GART_USWC, total 2 zones [15110.615647] MC start:0x27530000, Physical:0x0, size:0x26b50000. [15110.615650] Mapped heap -- Offset:0x10000, size:0x2000000, reference count:14, mapping count:0, [15110.615652] mc_node :GART_CACHEABLE, total 3 zones [15110.615654] MC start:0x10400000, Physical:0x0, size:0x17130000. [15110.615656] Mapped heap -- Offset:0x200000, size:0x200000, reference count:1, mapping count:0, [15110.615659] Mapped heap -- Offset:0x0, size:0x200000, reference count:3, mapping count:0, [15110.615661] Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0, [15110.615663] Dump the trace queue. [15110.615665] End of dump

genaganna · ‎01-26-2010

Originally posted by: mjharvey Hi,

I have a freshly installed Ubuntu 9.04 64b system (2.6.28-17-generic #58), with the 1.0 SDK and the hotfix driver.

When I run my application, it executes several kernels correctly but then seems to get stuck when running a particular kernel. The host process remains busy inside a thread spawned from the ocl runtime.

Eventually the kernel execution terminates (without returning an error to the host app) and the attached trace is printed to the kernel log. The program then runs three more kernels without error before attempting a memory copy form device to host.

At this point, the program locks up and becomes unkillable (once again, a thread spawned from the runtime is busy at 100% cpu). The X server is similarly locked and unkillable.

It's possible that there's a bug in that kernel (through it works ok with Nvidia OpenCL), but there's obviously a bug in the driver that's preventing recovery from the hang. Hopefully the stack trace will be useful for tracking down the fault.

MJH

Mjharvey,

Could you please post your application here or send to streamdeveloper@amd.com?

Archives Discussions

Stack trace grom linux kernel driver