cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mjharvey
Adept I

Stack trace grom linux kernel driver

Driver bug?

Hi,

I have a freshly installed Ubuntu 9.04 64b system (2.6.28-17-generic #58), with the 1.0 SDK and the hotfix driver.

When I run my application, it executes several kernels correctly but then seems to get stuck when running a particular kernel. The host process remains busy inside a thread spawned from the ocl runtime.

Eventually the kernel execution terminates (without returning an error to the host app) and the attached trace is printed to the kernel log. The program then runs three more kernels without error before attempting a memory copy form device to host.

At this point, the program locks up and becomes unkillable (once again, a thread spawned from the runtime is busy at 100% cpu). The X server is similarly locked and unkillable.

It's possible that there's a bug in that kernel (through it works ok with Nvidia OpenCL), but there's obviously a bug in the driver that's preventing recovery from the hang. Hopefully the stack trace will be useful for tracking down the fault.

 

MJH

 

 

13383.095745] fglrx_pci 0000:01:00.0: irq 2300 for MSI/MSI-X [13383.096308] [fglrx] Firegl kernel thread PID: 7701 [13383.436729] [fglrx] Gart USWC size:619 M. [13383.436732] [fglrx] Gart cacheable size:244 M. [13383.436736] [fglrx] Reserved FB block: Shared offset:0, size:1000000 [13383.436738] [fglrx] Reserved FB block: Unshared offset:fb0d000, size:1f3000 [13383.436739] [fglrx] Reserved FB block: Unshared offset:3fffb000, size:5000 [13556.810544] fglrx_pci 0000:01:00.0: irq 2300 for MSI/MSI-X [13556.811104] [fglrx] Firegl kernel thread PID: 7948 [13557.152313] [fglrx] Gart USWC size:619 M. [13557.152315] [fglrx] Gart cacheable size:244 M. [13557.152320] [fglrx] Reserved FB block: Shared offset:0, size:1000000 [13557.152322] [fglrx] Reserved FB block: Unshared offset:fb0d000, size:1f3000 [13557.152324] [fglrx] Reserved FB block: Unshared offset:3fffb000, size:5000 [15110.614875] [fglrx] ASIC hang happened [15110.614881] Pid: 11963, comm: acemd.ocl Tainted: P 2.6.28-17-generic #58-Ubuntu [15110.614883] Call Trace: [15110.614943] [<ffffffffa00f23c9>] KCL_DEBUG_OsDump+0x9/0x10 [fglrx] [15110.614981] [<ffffffffa00ff3cc>] firegl_hardwareHangRecovery+0x1c/0x50 [fglrx] [15110.615037] [<ffffffffa0174b19>] ? _ZN4Asic9WaitUntil15ResetASICIfHungEv+0x9/0x10 [fglrx] [15110.615092] [<ffffffffa0174acc>] ? _ZN4Asic9WaitUntil15WaitForCompleteEv+0x6c/0xb0 [fglrx] [15110.615147] [<ffffffffa0173b6d>] ? _ZN4Asic19PM4ElapsedTimeStampERK23PM4_TS_INTERRUPT_PARAMSj14_LARGE_INTEGER+0x18d/0x1c0 [fglrx] [15110.615153] [<ffffffff8069bab9>] ? _spin_lock+0x9/0x10 [15110.615194] [<ffffffffa011b142>] ? firegl_trace+0x72/0x1e0 [fglrx] [15110.615234] [<ffffffffa011b142>] ? firegl_trace+0x72/0x1e0 [fglrx] [15110.615291] [<ffffffffa016cb02>] ? _ZN15QS_PRIVATE_CORE27multiVpuPM4ElapsedTimeStampERK23PM4_TS_INTERRUPT_PARAMSj14_LARGE_INTEGER+0x32/0x50 [fglrx] [15110.615345] [<ffffffffa0166c13>] ? _Z19uQSTimeStampRetiredmjj14_LARGE_INTEGER+0xa3/0xb0 [fglrx] [15110.615397] [<ffffffffa01626f9>] ? _Z8uCWDDEQCmjjPvjS_+0x379/0x10c0 [fglrx] [15110.615439] [<ffffffffa011d5b4>] ? firegl_cmmqs_CWDDE_32+0x334/0x440 [fglrx] [15110.615479] [<ffffffffa011c060>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx] [15110.615484] [<ffffffff803f5f9c>] ? apparmor_capable+0x1c/0x70 [15110.615524] [<ffffffffa011bff0>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx] [15110.615560] [<ffffffffa00fb19a>] ? firegl_ioctl+0x1ea/0x250 [fglrx] [15110.615564] [<ffffffff8069bab9>] ? _spin_lock+0x9/0x10 [15110.615598] [<ffffffffa00f05c1>] ? ip_firegl_ioctl+0x11/0x20 [fglrx] [15110.615602] [<ffffffff802f66cd>] ? vfs_ioctl+0x7d/0xa0 [15110.615605] [<ffffffff802f6a35>] ? do_vfs_ioctl+0x75/0x230 [15110.615607] [<ffffffff802f6c89>] ? sys_ioctl+0x99/0xa0 [15110.615611] [<ffffffff8021253a>] ? system_call_fastpath+0x16/0x1b [15110.615615] pubdev:0xffffffffa0314dc0, num of device:1 , name:fglrx, major 8, minor 68. [15110.615617] device 0 : 0xffff88007c448000 . [15110.615619] Asic ID:0x6899, revision:0x2, MMIOReg:0xffffc200101c0000. [15110.615622] FB phys addr: 0xd0000000, MC :0xf00000000, Total FB size :0x40000000. [15110.615624] gart table MC:0xf0fb0d000, Physical:0xdfb0d000, size:0x1f2000. [15110.615627] mc_node :FB, total 1 zones [15110.615629] MC start:0xf00000000, Physical:0xd0000000, size:0xfd00000. [15110.615631] Mapped heap -- Offset:0x0, size:0xfb0d000, reference count:9, mapping count:0, [15110.615634] Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0, [15110.615637] Mapped heap -- Offset:0xfb0d000, size:0x1f3000, reference count:1, mapping count:0, [15110.615639] mc_node :INV_FB, total 1 zones [15110.615641] MC start:0xf0fd00000, Physical:0xdfd00000, size:0x30300000. [15110.615643] Mapped heap -- Offset:0x302fb000, size:0x5000, reference count:1, mapping count:0, [15110.615645] mc_node :GART_USWC, total 2 zones [15110.615647] MC start:0x27530000, Physical:0x0, size:0x26b50000. [15110.615650] Mapped heap -- Offset:0x10000, size:0x2000000, reference count:14, mapping count:0, [15110.615652] mc_node :GART_CACHEABLE, total 3 zones [15110.615654] MC start:0x10400000, Physical:0x0, size:0x17130000. [15110.615656] Mapped heap -- Offset:0x200000, size:0x200000, reference count:1, mapping count:0, [15110.615659] Mapped heap -- Offset:0x0, size:0x200000, reference count:3, mapping count:0, [15110.615661] Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0, [15110.615663] Dump the trace queue. [15110.615665] End of dump

0 Likes
1 Reply
genaganna
Journeyman III

Originally posted by: mjharvey Hi,

 

I have a freshly installed Ubuntu 9.04 64b system (2.6.28-17-generic #58), with the 1.0 SDK and the hotfix driver.

 

When I run my application, it executes several kernels correctly but then seems to get stuck when running a particular kernel. The host process remains busy inside a thread spawned from the ocl runtime.

 

Eventually the kernel execution terminates (without returning an error to the host app) and the attached trace is printed to the kernel log. The program then runs three more kernels without error before attempting a memory copy form device to host.

 

At this point, the program locks up and becomes unkillable (once again, a thread spawned from the runtime is busy at 100% cpu). The X server is similarly locked and unkillable.

 

It's possible that there's a bug in that kernel (through it works ok with Nvidia OpenCL), but there's obviously a bug in the driver that's preventing recovery from the hang. Hopefully the stack trace will be useful for tracking down the fault.

 

 MJH

 

 



Mjharvey,

             Could you please post your application here or send to streamdeveloper@amd.com?

0 Likes