31 Replies Latest reply on Mar 16, 2016 9:38 AM by pjb7687

    Driver crashes with OpenCL?

    pjb7687

      Hello,

       

      I am Jeongbin Park, the main developer of Cas-OFFinder (snugel/cas-offinder · GitHub, article: http://www.ncbi.nlm.nih.gov/pubmed/24463181).

      Recently we are experiencing crashes with the latest AMD driver (on Ubuntu linux) with huge input data.

       

      The symptom of 'crash' is that Cas-OFFinder runs fine with several hours, and then suddenly it hangs.

      After that, running any OpenCL program makes the terminal hang and doesn't respond.

       

      Also, when I tried to kill Cas-OFFinder or the newly created OpenCL processes,

      they won't simply be killed just like zombie processes - even with SIGKILL - and the only way to terminate them is system reboot.

       

      For your information, I designed the host-side program to make a lot of chunks of the input data,

      so that the running time per each chunk of the OpenCL kernels to be in relatively short time (in several seconds).

       

      Full source code of latest Cas-OFFinder: snugel/cas-offinder at experimental · GitHub

      Source code of OpenCL kernels can be found here: cas-offinder/cas-offinder.cl at experimental · snugel/cas-offinder · GitHub

       

      Could you see the source code and let me know why such a weird behavior happen?

      For your information, it occurs with AMD APP SDK 2.9.1 and also with 3.0 beta. Also, it does not crash with CPU.

       

       

      Thanks,

      Jeongbin

        • Re: Driver crashes with OpenCL?
          dipak

          Thanks for reporting the problem.

          Recently we are experiencing crashes with the latest AMD driver (on Ubuntu linux) with huge input data.

          May I assume that the issue is not occurring on earlier drivers, only using latest one? [Or please specify the last catalyst version where it worked fine]

           

          The symptom of 'crash' is that Cas-OFFinder runs fine with several hours, and then suddenly it hangs.

          Did you observe any particular pattern and dependency on external factors such as setup, input data, system load etc.? Please provide your setup details such as OS, GPU, CPU etc.. We'll try to reproduce it.

          BTW, is it possible to trigger the issue in less time? it would be helpful for our testing.

           

          Regards,

            • Re: Driver crashes with OpenCL?
              pjb7687

              Firstly I have tested it on the default version of driver shipped with Ubuntu 14.04.

              I also tried it with the latest driver, but nothing different.

               

              We are also trying to use a commercial cluster computer (Chundoong, 슈퍼컴퓨터 천둥) utilizing AMD graphic cards (7970 HD) and OpenCL, but the same problem also occurs.

              The cluster uses RHEL 6.3 as its host OS.

               

              I tested it with 7870 HD, 7970 HD, and R9 290X, but all has the same issue.

              Currently our server has two of R9 290X cards with the latest AMD driver.

               

              Unfortunately, the problem usually occurs in long time analysis, but I am not sure...

              I can provide input data but how can I upload them?

               

              Hardware summary of our server:

                OS: Ubuntu 14.04 LTS

                CPU: Intel i7 4770k

                GPU: 2x AMD R9 290X

               

               

              Thank you,

              Jeongbin

                • Re: Driver crashes with OpenCL?
                  dipak

                  Hi Jeongbin,

                  Thanks for the quick reply. By referring the catalyst version, I just wanted to confirm whether the issue is related to this version only or not.

                  Regarding the input data, you may attach the file here (if size within limit) or may upload to a public site and share us the link [if data file is password protected, you can send me the password via a private message]

                   

                  Regards,

                    • Re: Driver crashes with OpenCL?
                      pjb7687

                      First, you can download the input file from this link:

                      http://www.rgenome.net/static/targets.zip

                       

                      And you also need the reference genome of Human from here:

                      http://www.rgenome.net/static/human_hg38.zip

                       

                      First you need to unzip the above two compressed files,

                      and then the directory structure should be like below:

                       

                      ./human_hg38/chr1.fa

                      ./human_hg38/chr2.fa

                      ....

                      ./targets_1.txt

                      ./targets_2.txt

                      ...

                       

                      Finally you can run Cas-OFFinder like below:

                      cas-offinder targets_1.txt G output.txt

                       

                      (Of course you should have installed Cas-OFFinder on your system, maybe you can easily build one with CMake. Please test the 'experimental' branch of Cas-OFFinder [snugel/cas-offinder at experimental · GitHub]. Or I can also provide a compiled binary if you want)

                       

                      Please note that the first line of target_??.txt is the path of directory containing genome sequences.

                      You can try all of the 21 files in sequence, and then maybe you can find the problem that I reported.

                       

                       

                      Thank you,

                      Jeongbin

                        • Re: Driver crashes with OpenCL?
                          dipak

                          Thanks for providing the data files. We'll check and get back you shortly.

                            • Re: Driver crashes with OpenCL?
                              dipak

                              Hi Jeongbin,

                              AMD has just released Catalyst 15.7 [display driver version 15.20.x]. It has many improvements and fixes compared to earlier ones. Could you please check the issue once using this driver?

                              As you already have the working setup, it'll be quicker for you to verify than me. If issue still exists, we'll try here.

                               

                              Regards,

                              • Re: Driver crashes with OpenCL?
                                dipak

                                Hi Jeongbin,

                                Did you manage to check it with the latest 15.7 driver? If yes, what was your observation?

                                 

                                Regards,

                                  • Re: Driver crashes with OpenCL?
                                    pjb7687

                                    i tested the driver for a week, I haven't found any crash with it.

                                    In addition, I also found that the performance of evaluation is faster than before, about 20~30%. Very good!

                                     

                                    Thank you for your answer.

                                     

                                     

                                    Best,

                                    Jeongbin

                                      • Re: Driver crashes with OpenCL?
                                        dipak

                                        Thanks for your confirmation. Its really nice to hear that the latest driver is working fine.

                                         

                                        Regards,

                                          • Re: Driver crashes with OpenCL?
                                            pjb7687

                                            Hi,

                                             

                                            Today I found the same problem again with the latest driver.

                                            The process cannot be killed using the usual kill command of Linux, after sending kill signal it becomes 'defunct' process.

                                             

                                            One interesting thing is that when I tried to run the program again without arguments (then normally it shows the list of OpenCL devices installed on system),

                                            the program doesn't start well (shows blank) and it shows 100% CPU usage.

                                             

                                            It is very rare event, and I also think that it became even less frequent after I updated the driver with the latest one.

                                            Thank you.

                                             

                                             

                                            Regards,

                                            Jeongbin

                                              • Re: Driver crashes with OpenCL?
                                                dipak

                                                As you said last time that it was working fine with catalyst 15.7, and now its reoccurring again. Did you modify/update anything particular in between?

                                                 

                                                Regards,

                                                  • Re: Driver crashes with OpenCL?
                                                    pjb7687

                                                    No, I didn't make any modification since I updated the driver to catalyst 15.7.

                                                    I think that the problem is just occurring less frequent than before.. Because it is randomly occurring event, it is hard to say it was working fine with the latest driver. However I feel that it is more stable with the latest catalyst 15.7, because at least it passed our one-week of test.

                                                     

                                                    For your information, we also have NVidia graphic cards for testing purpose and they don't have such a problem.

                                                     

                                                    Maybe you can try our software to find what makes the problem. Its source code is available at Github (snugel/cas-offinder · GitHub ).

                                                    You can also try older version of Cas-OFFinder (snugel/cas-offinder at 8a39f3ea0c2daff23df578a151542f0b53a5ed80 · GitHub ), because it generates higher load on GPU then the latest version of Cas-OFFinder. You can still download the example data files from the below links.

                                                     

                                                    The input file:

                                                    http://www.rgenome.net/static/targets.zip

                                                     

                                                    The reference genome of Human:

                                                    http://www.rgenome.net/static/human_hg38.zip

                                                     

                                                     

                                                    Regards,

                                                    Jeongbin

                                                      • Re: Driver crashes with OpenCL?
                                                        dipak

                                                        Hi Jeongbin,

                                                        My apologies for this delayed reply.

                                                        I'll try to reproduce it at my end. Meanwhile, if you've any update, please share with us.

                                                         

                                                        Regards,

                                                        • Re: Driver crashes with OpenCL?
                                                          dipak

                                                          I followed your steps and ran the program through command line as: "./cas-offinder targets/targets_1.txt G output.txt"

                                                          After a long time, the program stopped with following error message:

                                                          cas-offinder.png

                                                           

                                                          P.S: The size of "output.txt" was more than 750MB during the program exit.

                                                           

                                                          Any suggestion?

                                                           

                                                           

                                                          Regards,

                                                            • Re: Driver crashes with OpenCL?
                                                              pjb7687

                                                              Could you please try it again on 64bit environment?

                                                              I'll try it on 32bit platform as soon as possible.

                                                               

                                                               

                                                              Best,

                                                              Jeongbin

                                                                • Re: Driver crashes with OpenCL?
                                                                  dipak

                                                                  I was using a 64bit (Ubuntu 14.04) setup only. In order to generate 64-bit executable, do I need to specify any flag during the cmake or make build? Because, last time I didn't specify any.

                                                                   

                                                                  Regards,

                                                                    • Re: Driver crashes with OpenCL?
                                                                      pjb7687

                                                                      Then you should have 64bit binary, if no option is specified.

                                                                       

                                                                      Could you try below:

                                                                      $ head -n 3 targets_1.txt > target_test.txt

                                                                      $ cas-offinder target_test.txt G test_out.txt

                                                                       

                                                                      Please let me know If you still have the same error.

                                                                       

                                                                      Best,

                                                                      Jeongbin

                                                                        • Re: Driver crashes with OpenCL?
                                                                          pjb7687

                                                                          And I found one (maybe) important information; currently we set GPU_MAX_ALLOC_PERCENT to 100.

                                                                           

                                                                          $ echo $GPU_MAX_ALLOC_PERCENT

                                                                          100

                                                                           

                                                                          I don't know whether the same environment variable is set on the cluster we have tried (Chundoong, http://chundoong.snu.ac.kr/), I will ask them quickly.

                                                                           

                                                                           

                                                                          Best,

                                                                          Jeongbin

                                                                          • Re: Driver crashes with OpenCL?
                                                                            dipak

                                                                            It seems that above steps are running fine. Please find the output (partial) below:

                                                                            Reading human_hg38/chr11.fa...

                                                                            Sending data to devices...

                                                                            Setting pattern to devices...

                                                                            Chunk load started.

                                                                            1 devices selected to analyze...

                                                                            Finding pattern in chunk #1...

                                                                            Comparing pattern #1 in chunk #1...

                                                                            Reading human_hg38/chr13.fa...

                                                                            Sending data to devices...

                                                                            Setting pattern to devices...

                                                                            Chunk load started.

                                                                            1 devices selected to analyze...

                                                                            Finding pattern in chunk #1...

                                                                            Comparing pattern #1 in chunk #1...

                                                                            Reading human_hg38/chr8.fa...

                                                                            Sending data to devices...

                                                                            Setting pattern to devices...

                                                                            Chunk load started.

                                                                            1 devices selected to analyze...

                                                                            Finding pattern in chunk #1...

                                                                            Comparing pattern #1 in chunk #1...

                                                                            19.3648 seconds elapsed.

                                                                              • Re: Driver crashes with OpenCL?
                                                                                pjb7687

                                                                                Then please try the experimental branch of Cas-OFFinder.

                                                                                The changes of the new version includes fix of few memory leaks, maybe I think one of them would affect the result.

                                                                                 

                                                                                By the way, we set two environment variables during boot (Below is file contents of /etc/profile.d/OPENCL.sh):

                                                                                 

                                                                                export GPU_MAX_ALLOC_PERCENT=100

                                                                                export GPU_USE_SYNC_OBJECTS=1

                                                                                 

                                                                                 

                                                                                 

                                                                                Best,

                                                                                Jeongbin

                                                                                  • Re: Driver crashes with OpenCL?
                                                                                    pjb7687

                                                                                    One of admins of the cluster computer says that they haven't set any environment variables on their computing nodes.

                                                                                    Maybe the environment variables are not directly related to the problem.

                                                                                     

                                                                                     

                                                                                    Best,

                                                                                    Jeongbin

                                                                                      • Re: Driver crashes with OpenCL?
                                                                                        dipak

                                                                                        I tried the experimental branch of Cas-OFFinder without setting any environment variables. After sometime, I got an error as shown below. However, I didn't observe any kind of system hanging or segfault issue.

                                                                                        cas-offinder_experimental.png

                                                                                         

                                                                                        Regards,

                                                                                          • Re: Driver crashes with OpenCL?
                                                                                            pjb7687

                                                                                            The error is new to me, actually I haven't seen it.

                                                                                            Could you let me know the environment of the test system? e.g. graphic cards, cpu, etc.

                                                                                             

                                                                                            It looks like you have troulble with reproducing the hanging error due to the clEnqueueBuffer error, then I will try to fix it first.

                                                                                            After I fix the issue I will try it again - and if I find the hanging error once more, then I will post it here.

                                                                                             

                                                                                            Thank you!

                                                                                             

                                                                                             

                                                                                            Best,

                                                                                            Jeongbin

                                                                                              • Re: Driver crashes with OpenCL?
                                                                                                dipak

                                                                                                My setup details:

                                                                                                CPU: AMD FX(tm)-4100 Quad-Core Processor

                                                                                                GPU: Hawaii XT (R9 290X)

                                                                                                OS: Ubuntu 14.04 64bit

                                                                                                Latest Catalyst 15.7

                                                                                                APP SDK 3.0

                                                                                                  • Re: Driver crashes with OpenCL?
                                                                                                    pjb7687

                                                                                                    Dear dipak,

                                                                                                     

                                                                                                    I've carried out a month of testing with the experimental version of Cas-OFFinder, however I couldn't reproduce the error you've encountered.

                                                                                                    Instead, yesterday I found that our production server stopped again with the issue that I reported for the first time, and I had to force reboot the server. Again, the error occurs very rarely.

                                                                                                     

                                                                                                    I've also carried out very long test (it is working more than 2 months) with NVidia cards, but it looks like it is okay till now.

                                                                                                    I am suspicious that maybe the error you've encountered is also related to the driver, could you verify it again?

                                                                                                     

                                                                                                    Please let me know if you find any flaws in the latest experimental version of Cas-OFFinder.

                                                                                                    I would really appreciate for your help.

                                                                                                     

                                                                                                     

                                                                                                    Best regards,

                                                                                                    Jeongbin

                                                                                                      • Re: Driver crashes with OpenCL?
                                                                                                        dipak

                                                                                                        Please let me know what driver you used for your testing. I'll try that one.

                                                                                                          • Re: Driver crashes with OpenCL?
                                                                                                            pjb7687

                                                                                                            Dear dipak,

                                                                                                             

                                                                                                            Please refer to below information.

                                                                                                             

                                                                                                            $ dmesg | grep fglrx | grep module

                                                                                                            [   16.923573] fglrx: module license 'Proprietary. (C) 2002 - ATI Technologies, Starnberg, GERMANY' taints kernel.

                                                                                                            [   16.928154] fglrx: module verification failed: signature and/or  required key missing - tainting kernel

                                                                                                            [   16.933152] <6>[fglrx] module loaded - fglrx 15.20.3 [Jun 22 2015] with 2 minors

                                                                                                             

                                                                                                            Thank you very much for your help.

                                                                                                • Re: Driver crashes with OpenCL?
                                                                                                  dipak

                                                                                                  Okay. I'll try and get back to you shortly.