cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ebfe
Journeyman III

Hard-lockup while/after calling GPU through OpenCL

Hi,

I'm the maintainer of Pyrit and currently try to make my OpenCL-code work smoothly with AMD's implementation. I'm running Ubuntu 9.04 with 9.12-hotfix using a HD4850.

While I managed to jump through the hoops required to get the code working, I currently experience hard lockups when using OpenCL on the GPU: The system completly stops responding with no choice but to hit the power-switch. This may happen while the code is executing or even after the process has successfully exited. It seems once Pyrit has been using OpenCL on the GPU-device, the system is prone to lockup within the next 30 seconds or so. This does not happen when using the CPU-device only.

As hard-lockups are impossible to debug myself, I welcome you to take a look at the code yourself. You'll find the most important OpenCL-related code in the function calc_pmklist().
The only source of the problem I can currently think of is the fact that the OpenCL-code is not called by the main thread of the process; this may cause locking issues in the OpenCL-library. Maybe you can guide me on how to gain more debug-information for this particular problem?

On the pro-side: The OpenCL-code running on Stream 2.0 is about twice as fast as the (almost) same code running on Stream 1.4 (via brooks). My HD4850 performs somewhere between a GTX280 and a GTX295 running CUDA.

0 Likes
12 Replies
nou
Exemplar

well it is posible that you hang a GPU in loop. on windows there is watchdog which reset GPU if execution of kernel took longer than 5 second or so. on linux there is no such thing. but it may be possible log in via SSH. then you know that you hang your GPU in loop with kernel. or it is driver bug.

but if CPU version does not hang then it is most likely bug.

0 Likes
ebfe
Journeyman III

SSH connections drop just like any other connections - the machine completly hangs 😉

0 Likes
ebfe
Journeyman III

I've done some more testing (hail journaling filesystems...) and can narrow down the problem somewhat further. Th machine has been running Pyrit for several hours today without problems. This held true until I killed Pyrit (e.g. with ctrl+c). A complete system deadlock happend some time after that. That deadlock is more likely to appear when restarting Pyrit (e.g. after ~10 seconds) but is just delayed on an idle system (e.g. a few minutes).
The system *always* deadlocks if Pyrit is killed. The system stays stable as long as Pyrit is not interrupted (or can exit on itself).

I think the problem is caused by Pyrit getting terminated while calling the OpenCL-API. The relevant OpenCL-code is executed in a damonized thread, which is simply killed by the OS after the python-interpreter exits.

This means that ocldevice_dealloc() is never executed and the command-queue, the kernel, the program and the context are never released. That should not be a problem as those resources should be process-specific and our process is just about to get terminated.

However this also means that the thread executing OpenCL-code for Pyrit may be somewhere within the OpenCL-API itself when the process gets killed; this includes clEnqueueNDRangeKernel() and all other OpenCL-functions called by calc_pmklist(). I highly suspect one of the OpenCL-functions in calc_pmklist() to leave the driver's global memory corrupted in case it's calling process is terminated.

0 Likes

i can confirm complete system hang up if i interupt program during runing a kernel and then run program again. it hang up inmedietly even sysrq do not work. i run my own program with kernel which take almost 4 second to finish and i run kernel 20 times. (during that time screen freeze only mouse pointer is moving. but SSH response normaly so it is only Xserver)

so to prevent this you should catch SIGINT and SIGTERM where you should call clFinish(). i do not release resource as command queue kernel and buffer. but it does not cause hang up system.

i tried even SmallptGPU with same result.

0 Likes

nou,
Can you post a simple test case that shows this problem so we can look at it and get it fixed?
0 Likes

run any OpenCL application. i tried it with SmallGPU http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html

run program and hit Ctrl+C to terminate program. it must run from terminal. then run program again and it hang. i think it must be interupted during kernel run.

0 Likes
ebfe
Journeyman III

Pyrit's scheduling targets an execution time (wall clock time) of 3 seconds. This is a *very* long execution time for an OpenCL-kernel.

If the assumptions made above are correct and the error is caused by corrupted global state, this is why the error is more likely to occur with Pyrit than other code that has execution times of a few milliseconds per call.

0 Likes
ebfe
Journeyman III

I've changed the code so that the process does not terminate (e.g. due to SIGKILL or SIGTERM) while the worker-thread is within an OpenCL-library call.

The system-crashes went away.

0 Likes

Originally posted by: nou run any OpenCL application. i tried it with SmallGPU http://davibu.interfree.it/opencl/smallptgpu/smallptGPU.html

 

run program and hit Ctrl+C to terminate program. it must run from terminal. then run program again and it hang. i think it must be interupted during kernel run.

 

I can confirm this as well. Using Stream SDK 2.0 & 10.1 drivers on Ubuntu 9.10 (with 2.6.28-11 kernel from 9.04). 5870 card.

Run once = ok. Next run of same = hard lockup.

 

 

0 Likes

Does the problem show up with 10.2 catalyst too?

0 Likes

yes. even it is not that easy lock-up it now.

0 Likes

I am able to reproduce the issue. Developers have been informed about the issue and they are looking at it. Thanks for reporting.

0 Likes