10 Replies Latest reply on Mar 8, 2012 6:05 PM by yurtesen

    2 GPU problems...

    yurtesen

      I have faced several problems with ATI/AMD  OpencL on the way but there are few very serious problems which everybody wants to be fixed. (accoding to what I saw in the forum posts which go years back in some cases)  (on Linux)

       

       

      1- Problem with dependence of X server. This seems to be something people were complaining about for very long time. Is there any plans of getting this fixed?

       

      2- If a compute kernel somehow freezes, the card gets frozen. The only way to recover is a reboot. Nvidia's opencl allows one to just do ctrl-c at console to stop a program run. They also provide a program which can reset a card... Is there any work on getting this fixed?

       

       

      Thanks,

      Evren

        • Re: 2 GPU problems...
          MicahVillmow

          1) The work is in progress, no ETA however.

          2) Good idea, I've let the relevant people know about it.

            • Re: 2 GPU problems...
              yurtesen

              Micah, nvidia was sort of smart about frozen cards. If the card is running X they have a kernel timeout, it stops the kernel in a few seconds if it doesnt finish. It is possible to switch to a console terminal and run the program from there to circumvent this timeout without completely shutting X down. If the X is not running on the card, then there is no timeout but I could always CTRL-C and exit programs without crashing anything

               

              When I run a program on a radeon, X freezes until program run on GPU ends and of course if the program crashes, then I have to find another terminal to ssh into the box and reboot the box. (In that case the running process turns into zombie as well).

                • Re: 2 GPU problems...
                  lrk_associates

                  About a week ago I posted asking about why a second card was not recognized by either fglrxinfo or clinfo.  Am I to understand from the original post here that OpenCL works through an X session assigned to a card?  If so, is there a HOWTO somewhere that talks about making such an assignment, and what X11 configuration files I am supposed to be editing to achieve this?  I don't see anything obvious in the SDK documentation, particularly Getting Started, but I certainly could have missed it.  Thanks.

                   

                  Laurence Keefe

                    • Re: 2 GPU problems...
                      yurtesen

                      http://devgurus.amd.com/message/1279111

                      Try the suggestion in this post and let us know, it didnt work for me but maybe it might work for you if both of your GPUs are from AMD

                        • Re: 2 GPU problems...
                          lrk_associates

                          The suggested solution consists in making sure that there are Device sections associated with each card.  I did this, manually, since aticonfig does not seem to do this automatically.  I have these entries in my xorg.conf file

                           

                          Section "Device"

                              Identifier  "aticonfig-Device[0]-0"

                              Driver      "fglrx"

                              Option        "Monitor-CRT1" "0-CRT1"

                              BusID       "PCI:1:0:0"

                          EndSection

                           

                          Section "Device"

                              Identifier  "aticonfig-Device[1]-0"

                              Driver      "fglrx"

                              BusID       "PCI:6:0:0"

                          EndSection

                           

                          When I reboot, my Xorg.0.log file contains the attached messages associated with these PCI addresses and the fglrx driver.  I apologize for the verbosity, but I have no idea what not to include.  Note the early entries where the two PCI addresses of the two cards are identified at 1:0:0 and 6:0:0, along with corresponding chipsets (0x6758 and 0x6738)

                           

                          The first possibly significant problem occurs when I see:

                           

                          [16.483] (WW) Falling back to old probe method for fglrx
                          [16.497] (II) Loading PCS database from /etc/ati/amdpcsdb
                          [16.498] (WW) fglrx: No matching Device section for instance (BusID PCI:0@6:0:0) found
                          [16.498] (--) Chipset Supported AMD Graphics Processor (0x6758) found
                          [16.499] (WW) fglrx: No matching Device section for instance (BusID PCI:0@0:0:0) found
                          [16.499] (WW) fglrx: No matching Device section for instance (BusID PCI:0@0:2:0) found
                          [16.499] (WW) fglrx: No matching Device section for instance (BusID PCI:0@0:4:0) found

                           


                          which suggests a Device section is missing for the 6:0:0 PCI ID, even though the xorg.conf file does contain one.  Since there are also a large number of similar messages for BusIDs that don't even seem to exist, it is unclear how serious this (supposed) lack of a Device section is.  Note also the fourth line above where a 0x6758 chipset is found.  This corresponds to the PCI 1:0:0 location and the 6670 card that is the regular graphics display.  Just after all the "No matching Device..." messages, however, there are two related lines:

                           

                          [16.499] (**) ChipID override: 0x6738
                          [

                          16.499] (**) Chipset Supported AMD Graphics Processor (0x6738) found

                           

                          which seem to suggest that the second card is being recognized, for the 0x6738 chipset corresponds to the second graphics card that I wish to use as the GPU.  What the "override" means is unclear.

                           

                          Following this sequence there is the first of several instances where both PCI addresses are apparently (successfully) accessed

                           

                          [16.503] ukiDynamicMajor: found major device number 250
                          [16.503] ukiDynamicMajor: found major device number 250
                          [16.503] ukiOpenByBusid: Searching for BusID PCI:1:0:0
                          [16.503] ukiOpenDevice: node name is /dev/ati/card0
                          [16.503] ukiOpenDevice: open result is 9, (OK)
                          [16.503] ukiOpenByBusid: ukiOpenMinor returns 9
                          [16.503] ukiOpenByBusid: ukiGetBusid reports PCI:6:0:0
                          [16.503] ukiOpenDevice: node name is /dev/ati/card1
                          [16.503] ukiOpenDevice: open result is 9, (OK)
                          [16.503] ukiOpenByBusid: ukiOpenMinor returns 9
                          [16.503] ukiOpenByBusid: ukiGetBusid reports PCI:1:0:0

                           

                          and each such instance is then followed by additional different messages that I am unqualified to evaluate.

                           

                          In the following section, both cards are accessed (the 6670 card has 1GB of DDR3, the 6870 1GB of DDR5), but there seems to be some claim about the BIOS on the 66870 card being invalid.  In the final section the actual correspondence between the PCI 1:0:0 card and the fglrx driver module in the kernel  seems to be made. 

                           

                          [16.999] (II) fglrx(0): [uki] DRM interface version 1.0
                          [16.999] (II) fglrx(0): [uki] created "fglrx" driver at busid "PCI:1:0:0"
                          [16.999] (II) fglrx(0): [uki] added 8192 byte SAREA at 0x2000
                          [16.999] (II) fglrx(0): [uki] mapped SAREA 0x2000 to 0x7f4483112000
                          [16.999] (II) fglrx(0): [uki] framebuffer handle = 0x3000
                          [16.999] (II) fglrx(0): [uki] added 1 reserved context for kernel
                          [16.999] (II) fglrx(0): swlDriScreenInit done
                          [16.999] (II) fglrx(0): Kernel Module Version Information:
                          [16.999] (II) fglrx(0): Name: fglrx
                          [16.999] (II) fglrx(0): Version: 8.93.4
                          [16.999] (II) fglrx(0): Date: Dec  5 2011
                          [16.999] (II) fglrx(0): Desc: ATI FireGL DRM kernel module
                          [17.000] (II) fglrx(0): Kernel Module version matches driver.

                           

                           

                          There is no similar section for the PCI 6:0:0 card in the messages.  I note that using the hwinfo --gfx command shows the fglrx driver as being active for both cards (at end of attached file). I hope this information will help someone diagnose this problem.

                           

                          Laurence Keefe

                    • Re: 2 GPU problems...
                      yurtesen

                      I think the situation is very critical (and as a proof just in 1 day somebody who is having the same problem posted to this thread already!). People are not able to run multi-gpu systems easily with AMD GPUs. Earlier in a different thread, you said that AMD is trying to focus on GPU OpenCL performance, performance is nothing if you cant run the programs

                       

                      BTW I am really fed up with rebooting my machine. I am scared to run OpenCL programs on my AMD GPU I test them on CPU nowadays, and then on another brand's GPU first before running on AMD!  Dont you think that it is ridiculous?

                       

                      I would at least like to see a way to recover GPU without rebooting the machine... or an automatic timeout which will recover by itself if GPU is used in X, since everybody cant go around to find another machine to login to their workstation to reset the card

                       

                      Thanks!

                        • Re: 2 GPU problems...
                          d.a.a.

                          Not being able to kill a process hanging a GPU is particularly bad in multi-user environments in which users have access to GPU computing. What if a user freezes a GPU? It's not admissible to just reboot the machine.

                            • Re: 2 GPU problems...
                              yurtesen

                              Today I was running a program on CPU and my Radeon crashed, I had to use a netbook until the CPU program finished...

                               

                              I work on programs which are used on supercomputer clusters. Guess if people will choose amd or nvidia when they throw in some 1k graphics cards in the next purchase., (my colleges were asking why I have 3 monitors on my desk but I am using a netbook! I had to say, amd crashed again...)

                               

                              and of course, once nvidia cards are bought, cuda will be used and an irreversible process will begin against AMD. Not many projects are actually using GPGPUs on large scale projects yet. It is very critical time for AMD.  Micah Villmow are you listening? I think you should get whoever you reported these problems to escalate them!

                                • Re: 2 GPU problems...
                                  MicahVillmow

                                  yurtesen,

                                  Can you provide your system setup and the app that is causing the problem so that we can help figure out what is going wrong?

                                    • Re: 2 GPU problems...
                                      yurtesen

                                      MicahVillmow wrote:

                                       

                                      yurtesen,

                                      Can you provide your system setup and the app that is causing the problem so that we can help figure out what is going wrong?

                                      The bug was in the opencl app. I was running, it caused  deadlock... But if an opencl app gets stuck, on nvidia hardware for example, I can just do ctrl-c and it cancels the execution gracefully. (unless if radeon doesnt do it on my system because of a problem?)

                                       

                                      If you are running these cards on nodes of a supercomputer, how do you think this problem could be handled? Especially since it is not at all feasible to reboot thousands of nodes at random times? People will simply be forced to choose to purchase,develop and optimize their codes for devices which does not require reboot and simply function!.

                                       

                                      I am not sure how this simple thing is not seen by the managers of the people who write the driver codes. All I can imagine is that maybe nobody told them that people have been complaining about these basic problems for years.

                                       

                                      Anyway, thanks for trying to help...