4 Replies Latest reply on Feb 18, 2010 1:36 AM by mosix0

    OpenCL library steals signals


      I have a program that uses OpenCL, but also does other things.  At times the program sleeps waiting for external signals (using "sigpause()"), but then I discovered that it failed to wake up when an expected signal arrived.


      I searched and found the reason: the OpenCL library uses full threads, including the clone-flags CLONE_SIGHAND and CLONE_THREAD: this causes the library threads to share signals with the main program, so at random an OpenCL library-thread picks up a signal that is intended to wake the main program, processes the interrupt and returns to whatever it was doing before, but then the main program is never awakened and remains stuck.


      I believe that although the library shares memory and perhaps also some file-descriptors with the main program, it has no need to share signals as well.


      In the least, if CLONE_SIGHAND cannot be avoided, the library should block all signals that it does not use - that would direct the Linux kernel to send those signals to the main program, rather than to a library thread.


      (note that if for any reason the library chooses to continue sharing signals with the main program, but only block them, there can still be a race just between the time that a library-thread is created and when it blocks unused signals - to prevent this race, the library should block all signals before calling CLONE, then unblock them in the parent/main thread, but not unblock them in the new library thread(s)).


      Hope this is not too complicated, but it is really a bummer to have the OpenCL library which is supposed to be a "black-box" affect unrelated aspects of the calling program.

        • OpenCL library steals signals


                    Could you please provide a test case to reproduce your problem?

            • OpenCL library steals signals

              Sorry for the delay - I had to construct a fully working OpenCL program for that (anything really, just add two vectors, this program doesn't even bother collecting the result).


              Description: A son process sends the main program a signal, SIGUSR1 after one second.  Meanwhile the main process sets up and queues an OpenCL kernel, then blocks all signals while it performs some "work".  When the "work" is finished, it uses "sigsuspend" to receive the signal.  However, the signal is caught before that by some OpenCL library thread.


              A good output should be:


              Main thread is 12345 - Woke up thread 12345


              Instead I get something like:


              Main thread is 12345 - Woke up thread 12348


              Here goes the program:


              #include <cl.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <signal.h>
              #include <unistd.h>

              char *src = "__kernel void add(__global int *A, __global int *B, __global int *C, int n){int i = get_global_id(0);C = A + B;}";

              int main_pid;

              #define __NR_gettid 186

              wakeup(int sig)
              printf("Main thread is %d - Woke up thread %d\n", main_pid, syscall(__NR_gettid));

              #define SIZE 60000

              main(na, argv)
              char *argv[];
              cl_platform_id platform;
              cl_context_properties cps[3];
              cl_context context;
              cl_command_queue queue;
              cl_device_id device;
              cl_program program;
              cl_mem ma, mb, mc;
              cl_kernel kernel;
              cl_event event;
              int n = SIZE;
              int A[SIZE], B[SIZE], C[SIZE];
              size_t sizes[3] = {SIZE, 0, 0};
              int son;
              cl_int res;
              sigset_t all, none;
              long work;

              main_pid = getpid();
              signal(SIGUSR1, wakeup);
              if(!(son = fork()))
              kill(getppid(), SIGUSR1);
              if(clGetPlatformIDs(1, &platform, NULL) != CL_SUCCESS)
              fprintf(stderr, "Could not get a platform\n");
              cps[0] = CL_CONTEXT_PLATFORM;
              cps[1] = (cl_context_properties)platform;
              cps[2] = 0;
              if(clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL)
              != CL_SUCCESS)
              fprintf(stderr, "No device\n");
              if(!(context = clCreateContext(cps, 1, &device, NULL, NULL, NULL)))
              fprintf(stderr, "No context\n");
              if(!(queue = clCreateCommandQueue(context, device,
              fprintf(stderr, "No queue\n");
              if(!(program = clCreateProgramWithSource(context, 1,
              (const char **)&src, NULL, NULL)))
              fprintf(stderr, "No program\n");
              if(clBuildProgram(program, 0, NULL, "", NULL, NULL) != CL_SUCCESS)
              printf("Program not built\n");
              if(!(kernel = clCreateKernel(program, "add", NULL)))
              printf("No kernel\n");
              if(!(ma = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,
              sizeof(A), A, NULL)) ||
              !(mb = clCreateBuffer(context, CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR,
              sizeof(B), B, NULL)) ||
              !(mc = clCreateBuffer(context, CL_MEM_WRITE_ONLY|CL_MEM_USE_HOST_PTR,
              sizeof(C), C, NULL)))
              fprintf(stderr, "Failed creating buffers\n");
              if(clSetKernelArg(kernel, 0, sizeof(ma), &ma) != CL_SUCCESS ||
              clSetKernelArg(kernel, 1, sizeof(mb), &mb) != CL_SUCCESS ||
              clSetKernelArg(kernel, 2, sizeof(mc), &mc) != CL_SUCCESS ||
              clSetKernelArg(kernel, 3, sizeof(n), &n) != CL_SUCCESS)
              fprintf(stderr, "Failed setting arg n\n");
              if((res = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, sizes,
              NULL, 0, NULL, &event)) != CL_SUCCESS)
              fprintf(stderr, "Kernel not running, res=%d\n", res);

              sigprocmask(SIG_SETMASK, &all, NULL);
              for(work = 0 ; work < 1000000000L ; work++)