cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pjb7687
Adept I

Driver crashes with OpenCL?

Hello,

I am Jeongbin Park, the main developer of Cas-OFFinder (snugel/cas-offinder · GitHub​, article: http://www.ncbi.nlm.nih.gov/pubmed/24463181).

Recently we are experiencing crashes with the latest AMD driver (on Ubuntu linux) with huge input data.

The symptom of 'crash' is that Cas-OFFinder runs fine with several hours, and then suddenly it hangs.

After that, running any OpenCL program makes the terminal hang and doesn't respond.

Also, when I tried to kill Cas-OFFinder or the newly created OpenCL processes,

they won't simply be killed just like zombie processes - even with SIGKILL - and the only way to terminate them is system reboot.

For your information, I designed the host-side program to make a lot of chunks of the input data,

so that the running time per each chunk of the OpenCL kernels to be in relatively short time (in several seconds).

Full source code of latest Cas-OFFinder: snugel/cas-offinder at experimental · GitHub

Source code of OpenCL kernels can be found here: cas-offinder/cas-offinder.cl at experimental · snugel/cas-offinder · GitHub

Could you see the source code and let me know why such a weird behavior happen?

For your information, it occurs with AMD APP SDK 2.9.1 and also with 3.0 beta. Also, it does not crash with CPU.

Thanks,

Jeongbin

0 Likes
31 Replies
dipak
Staff

Thanks for reporting the problem.

Recently we are experiencing crashes with the latest AMD driver (on Ubuntu linux) with huge input data.

May I assume that the issue is not occurring on earlier drivers, only using latest one? [Or please specify the last catalyst version where it worked fine]

The symptom of 'crash' is that Cas-OFFinder runs fine with several hours, and then suddenly it hangs.

Did you observe any particular pattern and dependency on external factors such as setup, input data, system load etc.? Please provide your setup details such as OS, GPU, CPU etc.. We'll try to reproduce it.

BTW, is it possible to trigger the issue in less time? it would be helpful for our testing.

Regards,

0 Likes

Firstly I have tested it on the default version of driver shipped with Ubuntu 14.04.

I also tried it with the latest driver, but nothing different.

We are also trying to use a commercial cluster computer (Chundoong, 슈퍼컴퓨터 천둥) utilizing AMD graphic cards (7970 HD) and OpenCL, but the same problem also occurs.

The cluster uses RHEL 6.3 as its host OS.

I tested it with 7870 HD, 7970 HD, and R9 290X, but all has the same issue.

Currently our server has two of R9 290X cards with the latest AMD driver.

Unfortunately, the problem usually occurs in long time analysis, but I am not sure...

I can provide input data but how can I upload them?

Hardware summary of our server:

  OS: Ubuntu 14.04 LTS

  CPU: Intel i7 4770k

  GPU: 2x AMD R9 290X

Thank you,

Jeongbin

0 Likes

Hi Jeongbin,

Thanks for the quick reply. By referring the catalyst version, I just wanted to confirm whether the issue is related to this version only or not.

Regarding the input data, you may attach the file here (if size within limit) or may upload to a public site and share us the link [if data file is password protected, you can send me the password via a private message]

Regards,

0 Likes

First, you can download the input file from this link:

http://www.rgenome.net/static/targets.zip

And you also need the reference genome of Human from here:

http://www.rgenome.net/static/human_hg38.zip

First you need to unzip the above two compressed files,

and then the directory structure should be like below:

./human_hg38/chr1.fa

./human_hg38/chr2.fa

....

./targets_1.txt

./targets_2.txt

...

Finally you can run Cas-OFFinder like below:

cas-offinder targets_1.txt G output.txt

(Of course you should have installed Cas-OFFinder on your system, maybe you can easily build one with CMake. Please test the 'experimental' branch of Cas-OFFinder [snugel/cas-offinder at experimental · GitHub]. Or I can also provide a compiled binary if you want)

Please note that the first line of target_??.txt is the path of directory containing genome sequences.

You can try all of the 21 files in sequence, and then maybe you can find the problem that I reported.

Thank you,

Jeongbin

0 Likes

Thanks for providing the data files. We'll check and get back you shortly.

0 Likes

Hi Jeongbin,

AMD has just released Catalyst 15.7 [display driver version 15.20.x]. It has many improvements and fixes compared to earlier ones. Could you please check the issue once using this driver?

As you already have the working setup, it'll be quicker for you to verify than me. If issue still exists, we'll try here.

Regards,

0 Likes

Hi Jeongbin,

Did you manage to check it with the latest 15.7 driver? If yes, what was your observation?

Regards,

0 Likes

i tested the driver for a week, I haven't found any crash with it.

In addition, I also found that the performance of evaluation is faster than before, about 20~30%. Very good!

Thank you for your answer.

Best,

Jeongbin

0 Likes

Thanks for your confirmation. Its really nice to hear that the latest driver is working fine.

Regards,

0 Likes

Hi,

Today I found the same problem again with the latest driver.

The process cannot be killed using the usual kill command of Linux, after sending kill signal it becomes 'defunct' process.

One interesting thing is that when I tried to run the program again without arguments (then normally it shows the list of OpenCL devices installed on system),

the program doesn't start well (shows blank) and it shows 100% CPU usage.

It is very rare event, and I also think that it became even less frequent after I updated the driver with the latest one.

Thank you.

Regards,

Jeongbin

0 Likes

As you said last time that it was working fine with catalyst 15.7, and now its reoccurring again. Did you modify/update anything particular in between?

Regards,

0 Likes

No, I didn't make any modification since I updated the driver to catalyst 15.7.

I think that the problem is just occurring less frequent than before.. Because it is randomly occurring event, it is hard to say it was working fine with the latest driver. However I feel that it is more stable with the latest catalyst 15.7, because at least it passed our one-week of test.

For your information, we also have NVidia graphic cards for testing purpose and they don't have such a problem.

Maybe you can try our software to find what makes the problem. Its source code is available at Github (snugel/cas-offinder · GitHub ).

You can also try older version of Cas-OFFinder (snugel/cas-offinder at 8a39f3ea0c2daff23df578a151542f0b53a5ed80 · GitHub ), because it generates higher load on GPU then the latest version of Cas-OFFinder. You can still download the example data files from the below links.

The input file:

http://www.rgenome.net/static/targets.zip

The reference genome of Human:

http://www.rgenome.net/static/human_hg38.zip

Regards,

Jeongbin

0 Likes

Hi Jeongbin,

My apologies for this delayed reply.

I'll try to reproduce it at my end. Meanwhile, if you've any update, please share with us.

Regards,

0 Likes

I followed your steps and ran the program through command line as: "./cas-offinder targets/targets_1.txt G output.txt"

After a long time, the program stopped with following error message:

cas-offinder.png

P.S: The size of "output.txt" was more than 750MB during the program exit.

Any suggestion?

Regards,

0 Likes

Could you please try it again on 64bit environment?

I'll try it on 32bit platform as soon as possible.

Best,

Jeongbin

0 Likes

I was using a 64bit (Ubuntu 14.04) setup only. In order to generate 64-bit executable, do I need to specify any flag during the cmake or make build? Because, last time I didn't specify any.

Regards,

0 Likes

Then you should have 64bit binary, if no option is specified.

Could you try below:

$ head -n 3 targets_1.txt > target_test.txt

$ cas-offinder target_test.txt G test_out.txt

Please let me know If you still have the same error.

Best,

Jeongbin

0 Likes

And I found one (maybe) important information; currently we set GPU_MAX_ALLOC_PERCENT to 100.

$ echo $GPU_MAX_ALLOC_PERCENT

100

I don't know whether the same environment variable is set on the cluster we have tried (Chundoong, http://chundoong.snu.ac.kr/), I will ask them quickly.

Best,

Jeongbin

0 Likes

It seems that above steps are running fine. Please find the output (partial) below:

Reading human_hg38/chr11.fa...

Sending data to devices...

Setting pattern to devices...

Chunk load started.

1 devices selected to analyze...

Finding pattern in chunk #1...

Comparing pattern #1 in chunk #1...

Reading human_hg38/chr13.fa...

Sending data to devices...

Setting pattern to devices...

Chunk load started.

1 devices selected to analyze...

Finding pattern in chunk #1...

Comparing pattern #1 in chunk #1...

Reading human_hg38/chr8.fa...

Sending data to devices...

Setting pattern to devices...

Chunk load started.

1 devices selected to analyze...

Finding pattern in chunk #1...

Comparing pattern #1 in chunk #1...

19.3648 seconds elapsed.

0 Likes

Then please try the experimental branch of Cas-OFFinder.

The changes of the new version includes fix of few memory leaks, maybe I think one of them would affect the result.

By the way, we set two environment variables during boot (Below is file contents of /etc/profile.d/OPENCL.sh):

export GPU_MAX_ALLOC_PERCENT=100

export GPU_USE_SYNC_OBJECTS=1

Best,

Jeongbin

0 Likes

One of admins of the cluster computer says that they haven't set any environment variables on their computing nodes.

Maybe the environment variables are not directly related to the problem.

Best,

Jeongbin

0 Likes

I tried the experimental branch of Cas-OFFinder without setting any environment variables. After sometime, I got an error as shown below. However, I didn't observe any kind of system hanging or segfault issue.

cas-offinder_experimental.png

Regards,

0 Likes

The error is new to me, actually I haven't seen it.

Could you let me know the environment of the test system? e.g. graphic cards, cpu, etc.

It looks like you have troulble with reproducing the hanging error due to the clEnqueueBuffer error, then I will try to fix it first.

After I fix the issue I will try it again - and if I find the hanging error once more, then I will post it here.

Thank you!

Best,

Jeongbin

0 Likes

My setup details:

CPU: AMD FX(tm)-4100 Quad-Core Processor

GPU: Hawaii XT (R9 290X)

OS: Ubuntu 14.04 64bit

Latest Catalyst 15.7

APP SDK 3.0

0 Likes

Dear dipak,

I've carried out a month of testing with the experimental version of Cas-OFFinder, however I couldn't reproduce the error you've encountered.

Instead, yesterday I found that our production server stopped again with the issue that I reported for the first time, and I had to force reboot the server. Again, the error occurs very rarely.

I've also carried out very long test (it is working more than 2 months) with NVidia cards, but it looks like it is okay till now.

I am suspicious that maybe the error you've encountered is also related to the driver, could you verify it again?

Please let me know if you find any flaws in the latest experimental version of Cas-OFFinder.

I would really appreciate for your help.

Best regards,

Jeongbin

0 Likes

Please let me know what driver you used for your testing. I'll try that one.

0 Likes

Dear dipak,

Please refer to below information.

$ dmesg | grep fglrx | grep module

[   16.923573] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.

[   16.928154] fglrx: module verification failed: signature and/or  required key missing - tainting kernel

[   16.933152] <6>[fglrx] module loaded - fglrx 15.20.3 [Jun 22 2015] with 2 minors

Thank you very much for your help.

0 Likes

Okay. I'll try and get back to you shortly.

0 Likes

Hi,

Sorry for this delayed reply.

The latest experimental branch seems running fine using the latest catalyst 15.9 (15.201.1151). Below command keeps running fine for more than one day.

./cas-offinder targets/targets_1.txt G output.txt

Please suggest how to reproduce the error you mentioned.

Regards,

0 Likes

Update:

After almost two days, the above program finished gracefully without any error.

0 Likes

Dear dipak,

I found that the new Crimson driver solves the issue.

After we update the driver, we don't have such an issue anymore.

Thank you,

Jeongbin

0 Likes