Archives Discussions

gat3way · ‎11-01-2010

Hello all,

I experience some very weird hard lockups when running my OpenCL kernel on GPU. On CPU it runs correctly and gives out the correct results. On GPU sometimes it runs correctly while sometimes it completely hangs the system to a point that it requires a reboot. Strange thing is that this apparently is dependent on the global_work_size and sometimes with a larger global work size, the program does not crash, while with smaller global work size it hangs.

The kernel overall does one read and five writes from/to global memory. The rest consists of arithmetic/bitwise operations on local uint4s (I mean __private ones I believe).

Since there is no SKA for linux and my debugging abilities are limited, I tried commenting out code to find out what exactly causes the problem, however the behavior is rather erratic and global worksize-dependent. Basically the hangs occur once a certain number of bitwise/arithmetics are performed on local variables.

My grid is 1-dimensional one and I provide NULL as local_work_size parameter so that OpenCL should choose the most appropriate value depending on the registers pressure and stuff like that. My theory is that for some reason, the OpenCL implementation does not properly calculate the register usage, thus the local work size is not being calculated properly and that leads to hard lockups (?!?).

Anyway, indeed I solved the problem by providing a hardcoded local_work_size value (chosen so that the global_work_size is divisible by that). Of course, performance dropped by about 20-30% due to that, yet that's acceptable to me.

I am using Radeon HD4670 and I am wondering whether this is OpenCL-related issue or hardware one. I will buy a 6870 card in the next 1-2 weeks and will do some testing to see whether this could be reproduced on that hardware as well.

I can post the kernel code, it's just about 100-200 lines, however the host code needed to properly setup all the parameters is much more than that. I can try to write a simplified test case though.

cjang · ‎11-01-2010

Try waiting several minutes (up to ten) to see if your system becomes responsive again. I have encountered similar symptoms as you describe. The system hangs on specific specializations of a parameterized kernel model/template. As I use auto-tuning, the application is searching over thousands of different kernels and runs into this.

In my particular case, this is a driver issue. It also happens with memory buffer based kernels. When using images, it never happens. Your speculation about register usage has some merit as array subscript arithmetic uses registers.

If you can get back into your system, check the kernel log. You may see a message indicating that the driver hung. I'm not sure what sort of watchdog causes this to timeout (I run Ubuntu 10.04 x86_64). But it does and then I can ssh into and use the system normally, although the X server and GPU/driver is now in a bad state (appears hung). This is where having a headless system may be an advantage. You may have to switch to a different virtual console and login again in order to do anything.

To give more background, a year ago with the now very old SDK v2.0 / Catalyst 9.12, this failure mode never happened. It started for me with SDK v2.1 / Catalyst 10.4. However, performance jumped immediately by 20% to 30%. There was a trade of some stability for higher performance.

Another thing is that the 20-30% difference you see can likely be recovered with careful tuning. There are probably other nearby kernel specializations in the design space that reach the same peak without stepping over the limit and causing a failure. This is my experience from auto-tuning.

gat3way · ‎11-01-2010

Hello and thanks for your reply.

I've tried waiting for several minutes, but the system did not recover. I was unable to ssh to the machine (I did not try whether it responds to ICMP ping though - next time I would try). Browsing /var/log/syslog, I see no kernel panics. I haven't checked the Xorg log for weird errors though.

I am using Catalyst 10.9 and SDK2.2.

It looks like 4670 does not support images

cjang · ‎11-01-2010

Yes, AFAIK, images require the Evergreen 5xxx architecture. It does not work on older GPUs. Another thing - you mention acquiring a 6870 card soon. You may wish to get a 5870 as it supports double precision.

The hangs I experienced were completely deterministic and repeatable. At first, I was tuning by hand and kept notes of the kernel parameters that caused the system to hang. This was not scalable so then I had to implement a memo which kept track of all kernel parameters, good and bad, automatically. In this way, it became practical to tune kernels.

It's not an ideal situation. But my guess is that you can find working kernels with the extra 20-30% performance if you do enough tuning. The kernels are "out there". You just have to find them and then know where they are.

gat3way · ‎11-01-2010

Double precision is good, but I don't need it as all I do is 32-bit arithmetic/bitwise ops. In fact, I don't use any floating point stuff at all.

I think part of the problem is that we don't have something like SKA on linux. It all boils down to trials and errors and this sucks. To all AMD people: we really need profiling tools on linux....

BTW another thing is that while testing my program with valgrind, I see lots of issues in the cal/opencl libraries, mostly related to out-of-bounds memory reads/writes. Upgrading my libstdc++ miraculously eliminated some of those, but still I see those.

cjang · ‎11-01-2010

> while testing my program with valgrind, I see lots of issues in the cal/opencl libraries

I am not using Valgrind but have used Purify in the past. My experience is that a lot of semantic information is lost in binary libraries. The result is many false positives during analysis. The cases I remember were inside Iona's CORBA libraries. This makes some intuitive sense as an ORB must do a lot of aliasing which will tend to confound a compiler.

nou · ‎11-01-2010

valgrind: Micah some where stated that valgrind is confused with that some meory are allocted in user space and free in kernel space and oposite (IMHO it is driver issue in fglrx). this makes valgrind useless with any program wich use fglrx in any way. for example OpenGL programs have same issue.

gat3way · ‎11-01-2010

Userspace code free'ing kmalloc()'d memory?

P.S nope, no icmp replies as well. Nevertheless, the local work size hack does the job well. I just need to retry this on the 6870 - if that turns out to be a 4xxx issue, I will just treat it as a corner case. It is an ancient platform nowadays I understand.

nou · ‎11-02-2010

i don't know exctly why valgrind report this memory leaks. but in short vlgrind don't understnd what is fglrx doing so it report as memory leak and access out of array boundaries.

HarryH · ‎11-02-2010

It is possible to use valgrind if you use suppression files to suppress the messages

caused by libraries etc. See the valgrind documentation. This way you will only

see stuff caused by your own code.

saleel · ‎11-04-2010

How many GPR's are you using in your kernel?

gat3way · ‎11-04-2010

I'm afraid I can't answer that. Is there some way to check this on linux?

nou · ‎11-04-2010

export GPU_DUMP_DEVICE_KERNEL=3

run your program. then open file name_of_kernel_Cypress.isa and scroll down and find this

SQ_PGM_RESOURCES:NUM_GPRS     = 18 //number of register
SQ_PGM_RESOURCES:STACK_SIZE           = 2 // this should be deopth of branch stack.
SQ_PGM_RESOURCESRIME_CACHE_ENABLE   = 1
;SQ_PGM_RESOURCES_2      = 0x000000C0
SQ_LDS_ALLOC:SIZE        = 0x00000600 // size of static __local memory in hexa/4 this is 6144 bytes

saleel · ‎11-04-2010

Please also provide MaxScratchRegsNeeded. My guess is it MaxScratchRegdsNeeded != 0 then, its a bug that would likely be fixed in the next catalyst release.

gat3way · ‎11-04-2010

SQ_PGM_RESOURCES:NUM_GPRS = 28

SQ_PGM_RESOURCES:STACK_SIZE = 3

MaxScratchRegsNeeded = 4

saleel · ‎11-04-2010

This bug has been fixed and will be released in the later catalyst releases. Its because of scratch spilling that was incorrectly handled on HD4670.

gat3way · ‎11-05-2010

I'm waiting for 10.11

eklund_n · ‎11-05-2010

I'm experiencing just the same behavior on HD5870, sure it's just HD4670?

Catalyst 10.9/SDK 2.2

Will export GPU_DUMP_DEVICE_KERNEL and report gpr and scratch reg count as soon as the system starts responding again.

edit: system didn't recover, had to reboot.

ok, my counts:
SQ_PGM_RESOURCES:NUM_GPRS = 30
SQ_PGM_RESOURCES:STACK_SIZE = 4
MaxScratchRegsNeeded = 0

The thing is that I on this kernel only spawns 4 workitems in same workgroup, globalSize = localSize = 4. The kernels does a for-loop over all input data.

Even though it's only 1 wavefront, aticonfig --od-getclocks report 99-100 % load on the HD5870. Why is that?

edit 2:

found this in Xorg.0.log, aroud 100 lines of the same thing:

(WW) fglrx(0): ADL handler failure: Could not find adapter at Bus ID 0:0:0
it seems that the video card is lost from the OS.

and this in kern.log:

Nov 5 09:58:05 opencl-devburk kernel: [1378621.152300] BUG: soft lockup - CPU#5 stuck for 61s! [clmain:29276]

now the video card is found again by Xorg and I can use the desktop (I have stopped the OpenCL program).

Archives Discussions

Weird lockups when using GPU device on linux