2 Replies Latest reply on Jun 18, 2014 2:21 AM by pinform

    "CPU#1 stuck for 23s" Error,when using multiple GPUs

    cltux

      hello,

      I'm currently setting up a linux system with 3 radeons (1x radeon6950, 1x radeon5870, 1x radeon5850). I'm using a fresh installed ubuntu 13.04 (as recommended on amd's beta driver page) and the amd catalyst 13.11 beta driver (from the same page).

      I'm testing with a selfmade program, that simply calculates lots of exponential functions, but the same problem occurs with other programs (e.g. cgminer). I'm running the OpenCL code in command line mode without xorg.

      One process uses exactly one opencl device. It works perfect with every single gpu, when only one single process (thus only one single gpu) is running at a time.

      When I start two (or more) processes at the same time AND one of these processes uses the radeon5870, the system crashes and I get "BUG soft lockup - CPU#1 stuck for 23s" messages on the console.

      Then I cannot do any input (like strg+c, or alt+fN to switch to console N). SSH still works, but I cannot kill the opencl processes (not even with -9). The system is just damaged and I have to reboot.
      I already tried to clock down the cpus and the gpus, which didn't help. Google has many results on that cpu stuck error message, but most of them in totally different contexts, other than GPU computing.
      I'm using a cooler master silent pro 800W 80+ gold psu and the system uses about 400W only, with two gpus running, so it does not seem to be a problem of energy.
      When I run the 5850 and the 6950 together it works perfectly (so it is also not a general problem of using multiple gpus at the same time, it only occurs when the 5870 is one of the GPUs used). On the other hand, when I run the 5870 alone, it also works perfectly (so the gpu does not seem to be defect). It makes absolutely no sense to me.

       

      I'm really stuck now and don't know how I can debug this and I have no idea what further investigations, might solve the problem.

       

      Does anyone know what exactly the CPU#1 stuck message might mean in this context?

      Is there any known problem dealing with running multiple OpenCL processes in parallel, or running OpenCL with multiple different GPUs?

      Has anyone experienced similar problems?

      Can anyone give me a hint about what I could check next, for hunting down the problem?

       

      THX for any help!