13 Replies Latest reply on Aug 13, 2013 12:46 AM by himanshu.gautam

    Segmentation fault at clWaitForEvents

    ealbayrak

       

      Hi I'm encountring with a segmentation fault right at clWaitForEvents line, and I copied the whole code from another code that I wrote and modified it for my new problem, the other code is working fine, but this one is throwing segmentation fault at clWaitForEvents line. I've checked every single line but still couldn't find the main reason.

      Any ideas about it? has anybody encountered with a similar stuff?

      Thank you for your help

      Code:

      globalThreads[0] = noOfTransactions;

          localThreads[0]  = 1;

          status = clEnqueueNDRangeKernel(

          commandQueue,

                       kernel_supportCounting,

                       1,

                       NULL,

                       globalThreads,

                       localThreads,

                       0,

                       NULL,

                       &events[0]);

          if(status != CL_SUCCESS) 

      std::cout<< "Error: Enqueueing kernel onto command queue. (clEnqueueNDRangeKernel)\n";

      return 1;

      }

       

      status = clWaitForEvents(1, &events[0]);

          if(status != CL_SUCCESS) 

      std::cout<<

         "Error: Waiting for kernel run to finish. (clWaitForEvents)\n"<< (CL_OUT_OF_RESOURCES == status);

      return 1;

      }



        • Segmentation fault at clWaitForEvents
          himanshu.gautam

          Please provide more info: CPU,GPU,SDK,DRIVER,OS.

          Also Post the kernel code you are executing.

          • Segmentation fault at clWaitForEvents
            jnygaard

            Hi, I don't have an answer, just wanted to let you know that others also have a problem similar to this. (I just found your post googling for this subject.)

            In my case I have two events that I want to wait for, one from a queue running on a GPU and one on the CPU, so I do

            err = clWaitForEvents( read_events.size(), &read_events[0] );

            where the size is then 2. This seg.faults so I don't get a return value. If I just wait for one at a time,

             

              ret = clWaitForEvents( 1, &read_events[0] );

              ret |= clWaitForEvents( 1, &read_events[1] );
            then everything seems to work. Obviously, this won't help you, but maybe someone gets any idea from this? I will continue looking into it and post if I figure it out.
            J.O.
            PS:
            My setup:
               Number of platforms: 2
                 Querying platform num. 0 (NVIDIA CUDA):
                 On this platform, 1 device(s) available.
                   Querying device num. 0 (GeForce GT 240):
                 Querying platform num. 1 (AMD Accelerated Parallel Processing):
                 On this platform, 1 device(s) available.
                   Querying device num. 0 (Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz):
            I use NVIDIA driver 260.19.44 and APP sdk 2.4.

             

              • Re: Segmentation fault at clWaitForEvents
                noah_r

                I'm finding this old thread following troubleshooting a segmentation fault error that occurs within clWaitForEvents() when the number events in the wait list is greater than one.  The scenario and workaround is similar to the post by jnygaard two years ago.  My program works fine if I break the list of four events into for individual calls to clWaitForEvents().

                 

                System Setup.  AMD APP 2.8.1 on Ubuntu 13.04 and 12.04.  CPU Only execution.  All API actions taking place in a single OpenCL context.

                 

                At the time I call clWaitForEvents() I have an array of four cl_event handles associated with two command queues and also one user event.

                { enqueueNDRangeKernel, clEnqueueMapBuffer,  clCreateUserEvent, clEnqueueMapBuffer }

                 

                Here is some debug data for the case of calling clWaitForEvents one event at a time in a for-loop. ( single case worked in forward or reverse order traversal through the list of events.)

                 

                In waitForEvents() before wait:

                Event indx 0, handle 0x7f5610000b80, ref cnt 7: QUEUE: 0x4236430, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_SUBMITTED

                Event indx 1, handle 0x7f5610001030, ref cnt 4: QUEUE: 0x4236890, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_QUEUED

                Event indx 2, handle 0x7f5610001220, ref cnt 3: QUEUE: 0,         CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_SUBMITTED

                Event indx 3, handle 0x7f5610001380, ref cnt 3: QUEUE: 0x4236430, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_QUEUED

                In waitForEvents() AFTER wait:

                Event indx 0, handle 0x7f5610000b80, ref cnt 4: QUEUE: 0x4236430, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_COMPLETE

                Event indx 1, handle 0x7f5610001030, ref cnt 1: QUEUE: 0x4236890, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_COMPLETE

                Event indx 2, handle 0x7f5610001220, ref cnt 1: QUEUE: 0,         CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_COMPLETE

                Event indx 3, handle 0x7f5610001380, ref cnt 2: QUEUE: 0x4236430, CONTEXT: 0x4224fd0, EXECUTION_STATUS: CL_COMPLETE

                 

                And if clWaitForEvents() is called with all four events at once...

                 

                In waitForEvents() before wait:

                Event indx 0, handle 0x7f4c9c000b80, ref cnt 7: QUEUE: 0x43da430, CONTEXT: 0x43c74b0, EXECUTION_STATUS: CL_RUNNING

                Event indx 1, handle 0x7f4c9c001030, ref cnt 4: QUEUE: 0x43da890, CONTEXT: 0x43c74b0, EXECUTION_STATUS: CL_QUEUED

                Event indx 2, handle 0x7f4c9c001220, ref cnt 3: QUEUE: 0,         CONTEXT: 0x43c74b0, EXECUTION_STATUS: CL_SUBMITTED

                Event indx 3, handle 0x7f4c9c001380, ref cnt 3: QUEUE: 0x43da430, CONTEXT: 0x43c74b0, EXECUTION_STATUS: CL_QUEUED

                0:Segmentation Violation error, status=: 11

                (rank:0 hostname:landau-ubuntu pid:6418):ARMCI DASSERT fail. src/common/signaltrap.c:SigSegvHandler():310 cond:0

                 

                The segmentation fault happens upon step into clWaitForEvents().

              • Re: Segmentation fault at clWaitForEvents
                Meteorhead

                Hi!

                 

                Let me revive the topic with the same issue, but a different instance of it. I have written a simple multi-device OpenCL app with the new cl.hpp that ships with SDK 2.8, and the code crashes on cl::WaitForEvents(). I have attached the main.cpp (and kernel for convenience) that produces the issue. The events are produced by kernel functors originating from different contexts, but the same platform and same thread. I'd expect such event to be compatible in a call to clWaitForEvents, however the app crashes. The interesting part of the code is right here:

                 

                    ////////////////////
                    // Launch kernels //
                    ////////////////////
                    std::cout << "Launching kernels." << std::endl;
                    std::vector<cl::Event> status(devices.size());
                    try {for(unsigned int i = 0 ; i < status.size() ; ++i) status.at(i) = kernel_functors.at(i)(device_args.at(i), input1_buffs.at(i), input2_buffs.at(i), output_buffs.at(i), multiply);}
                    catch(cl::Error err) {std::cerr << err.what() << std::endl; return EXIT_FAILURE;}
                    for(auto& stat : status) stat.wait();
                    std::cout << "Finished!" << std::endl;
                

                 

                This version of the code (as attached) works fine. As soon as I change to cl::WaitForEvents(status), it crashes. Any ideas why this happens? I'm using Windows Server 2012 with VS 2012 CTP Nov compiler and Catalyst 12.10. Crash happens inside clWaitForEvents().

                  • Re: Segmentation fault at clWaitForEvents
                    developer

                    Hi Meteorhead,

                     

                    I just went through the code.

                    I see buffers are getting created without specifying the context associated with them (See below)

                    "

                        cl::Buffer input1_buffer(input1_vector.begin(), input1_vector.end(), isReadOnly);   // Handles to buffers

                        cl::Buffer input2_buffer(input2_vector.begin(), input2_vector.end(), isReadOnly);   // Handles to buffers

                        cl::Buffer output_buffer(output_vector.begin(), output_vector.end(),!isReadOnly);   // Handles to buffers

                    "

                    I am yet to dig this more. I will do this next week.

                    But before that, I just wanted to share this with you:

                    "

                    A memory buffer is always associated with a context. For contexts with multiple-devices, the "cl_mem" object still belongs only to the context and is shared by all devices in that context. It is upto the programmer to synchronize accesses to it so that 2 devices are not updating the same part of the buffer. Sub-buffers come handy here. You could partition a buffer into multiple sub-buffers (as you have done) and allow each device to work on 1 sub-buffer -- All this within 1 single context.

                    So, there is no real need to create multiple contexts just because we have multiple devices

                    "

                     

                    Anyway, I will look more into your problem next week,

                    Good luck,

                     

                    Best Regards,

                    Workitem 6

                      • Re: Segmentation fault at clWaitForEvents
                        nou

                        I stumbled exactly on this issue with programs and nVidia. I didn't specified context when creating programs and got segfault or invalid program error (don't know exactly) on nVidia platform. On AMD it worked find. Then I found out that there are getDefault() methods for getting default context which apparently didn't work on nVidia platform.

                          • Re: Segmentation fault at clWaitForEvents
                            developer

                            Thank you Nou for sharing this piece of info.

                             

                            Meteorhead, Does this resolve your issue? If not, I will give a try next week and get to the roots of the problem.

                             

                            Best Regards,

                            Workitem 6

                              • Re: Segmentation fault at clWaitForEvents
                                Meteorhead

                                Hi developer and Nou,

                                 

                                thank you for your suggestions. It has been some time since I had to write an OpenCL program from scratch (I think we all know the copy-paste of device init from earlier projects, the curse of boilerplate code), and I overlooked some stuff. First this issue of specifying a context for the buffers which the C++ wrapper takes the liberty of finding a default context for.

                                 

                                I have altered my code so that all devices from a platform are put into the same context and that they are run in parallel and cl::Event::WaitForEvents() is only used inside a single platform (along with buffers, command queues and all those things), so the application can work in any mixed scenarios utilizing all devices. (Apart from the case where both AMD and Intel platforms are present, since both will enumerate the CPU, which is naturally unwanted)

                                 

                                Now the program works fine, CPU and GPU code runs neatly in parallel. Now I'll check if multi-GPU runs in parallel when run from a single context and controlled from a single thread. The last time I tried over a year ago multiple threads and contexts were required to achieve true concurrency (which sort of defeats the objective, because most memory goodness of OpenCL is only valid across devices in a single context).

                                  • Re: Segmentation fault at clWaitForEvents
                                    developer

                                    Hi Meteorhead,

                                     

                                    Good to know the code works fine now.

                                    OpenCL never necessitated multiple host threads for multiple OpenCL devices. In fact, the context wraps off everything neatly.  Thread-safety for contexts (multiple threads working on single openCL context) was introduced only in OpenCL 1.1

                                    In fact, it was CUDA that started with separate threads for separate devices.... and then finally now they support multiple devices under single host-thread.

                                    Wish you good luck with your experiments.

                                    Best Regards,

                                    Workitem 6

                                      • Re: Segmentation fault at clWaitForEvents
                                        Meteorhead

                                        I did not know that thread safety was introduced as a requirement in OpenCL1.1, all I know is that a a while back (roughly 1-1.5 years ago) if one created a context with 2 GPUs in it, and he/she launched kernels to them through 2 command queues, the 2 GPUs serialized their operations and the second started working after the first finished, because it was a limitation imposed by the AMD runtime. One needed 2 contexts created in 2 seperate threads to be able to control 2 GPUs concurrently.

                                          • Re: Segmentation fault at clWaitForEvents
                                            developer

                                            Hi,

                                            The runtime would serialize the command queues -- if they worked on the same CL_MEM object.

                                            So, the best bet is to create SUB_BUFFERS (introduced in OpenCL 1.2) and use them.

                                            When you use SUB_BUFFERs then the run-time won't serialize.

                                             

                                            All this is because the CL_MEM object is owned by the Context and not by the device.

                                            So, if 2 devices run kernels that updated the CL_MEM object -- That is actually a Race condition.

                                            This condition is analyzed in OpenCL Spec.

                                             

                                            Check Appendix A.1 - Shared OpenCL Objects

                                            Check Appendix A.2 - Multiple Host Threads

                                             

                                            Hope this helped,

                                            Best Regards,

                                            Workitem 6