7 Replies Latest reply on Oct 25, 2011 5:23 PM by N3KO

    Strange HD5870 performance problem

    N3KO

      Hi all!

      I have several HD5870 GPUs. Until recently, I used to get around 330 Gflop/s of performance from my double precision application on each GPU. However, since some time until now, several of the GPUs are starting to show a loss of performance.

      Right now, 5 out of 20 GPUs are significantly and consistently under-performing although they are still returning the correct numerical result. To illustrate my problem, I can point out that I have one GPU performing at 85 Gflop/s (roughly 25 percent of the original performance), some other GPUs are performing as low as 150 and 250 Gflop/s. All of my test are done on identical systems (hardware and software), with the same linux OS and drivers, and with no other application running on the system.

      For troubleshooting, I tried swapping the GPUs between different computers and the performance-loss problem followed the GPU. So it seems clear to me that the performance issue is related to the GPU.

      Has any one experienced a similar performance degradation? I really don't know what could I do and using google to search for the problem has not given me any results so far. Any advice or suggestions will be greatly appreciated :-)

       

      Cheers,

      N3KO

        • Strange HD5870 performance problem
          Meteorhead

          Just as a guess, I would monitor the gpu clocks and temperature using

          watch -n 0.5 'aticonfig --adapter=ALL --odgc'

          watch -n 0.5 'aticonfig --adapter=ALL --odgt'

          (Before you need to enable overdrive commands. Refer to 'aticonfig --help' for that, I don't know it off the top of my head, but I guess it's --od-enable. It will not overclock your system just yet, just enable monitoring tools.

          If the degradation follows the GPUs, I would first suspect, that the coolers are either dusty or the fans are not operating properly. I would first test if on longer simulations the GPU has to pull clockrates down because of overheating. You should see clockrates in --odgc shift from the maximum to something lower.

          Give a look at that, and if all GPUs operate on the same clocks the entire time, then it comes to debugging the software.

            • Strange HD5870 performance problem
              N3KO

              Hi Meteorhead,

              Thanks for the advice. I also thought that for some reason the GPU was being underclocked. However, all of the values are within normal range while running my program. I monitored the temperature, GPU clock and voltage, and reported performance level. Good and bad GPUs report similar temperatures, and their clocks, voltage, and performance level are the same while running.

              I looked into the GPUs, and I can see no dust. I wonder if it could be a problem with the heatsink but the GPU temperature remains low under load ~70 degrees. 

               

              Thanks!

                • Strange HD5870 performance problem
                  Meteorhead

                  Hi N3EKO!

                  Unfortunately, I don't have any other ideas from a HW side. If all results are valid, and clockrates do not differ on machines, then the only thing I can imagine is that some part of the GPU is disabled and some part of the chip sits idle, and all kernels occupy only some compute units, but not get submitted to others, thus being serialized in a way.

                  I do not know if 5870 is capable of doing such a thing, if it detects HW malfunction, similar to setting bad sectors on an HDD. I doubt it would do such a thing. I believe if part of the GPU becomes faulty, it would just produce corrupt data.

                  From a software side, when compiling programs, the application collects required device capability info from the runtime, the runtime queries the driver, which queries the device (as I would guess). I doubt that the driver could detect faulty segments of the GPU, so there is really no reason that the kernels you compile would differ from each other.

                  So according to this, I am out of ideas as to what might cause your problem.

                    • Strange HD5870 performance problem
                      N3KO

                      I just finished testing different programs on the GPU and found that my performance problem seems to be an issue with the GPU PCIe!

                      After running the PCIe speed test on the GPUs, I found that GPUs with performance problems also have a proportionaly bad GPU to CPU transfer speeds.

                      For example, the GPU that performs at 25% has transfer speed of 400 MB/sec. Other GPUs are transfering at about 1.6 GB/sec. While normal speeds are at around 6.7 GB/sec (I got this information using the PCIe Speed Test v0.2)

                      It seems that some of the PCI lanes decided to stop working for some reason... hmmmm.

                    • Strange HD5870 performance problem
                      genaganna

                       

                      Originally posted by: N3KO Hi Meteorhead,

                       

                      Thanks for the advice. I also thought that for some reason the GPU was being underclocked. However, all of the values are within normal range while running my program. I monitored the temperature, GPU clock and voltage, and reported performance level. Good and bad GPUs report similar temperatures, and their clocks, voltage, and performance level are the same while running.

                       

                      I looked into the GPUs, and I can see no dust. I wonder if it could be a problem with the heatsink but the GPU temperature remains low under load ~70 degrees. Thanks!

                       

                      Did you update driver or SDK? Are you using same driver & sdk in both cases?

                        • Strange HD5870 performance problem
                          N3KO

                          Hi genaganna,

                          I did not update the driver or SDK. So, I don't suspect it could be a driver issue.

                          Still, I'll try updating to the latest driver and see if that helps. Thanks!

                            • Strange HD5870 performance problem
                              N3KO

                              Hi all!

                              I found a solution to my problem :-)

                              I took one of the affected GPUs and cleaned the PCIe connector (which at normal sight seemed to be perfectly clean). Once I connected the GPU to the Motherboard, I shaked the GPU a little bit, restarted the computer, and the PCIe Speed test and my application went back to normal performance.

                              This worked right away on the first GPU that I tried it but in a second GPU it required 3 or 4 attempts before working. I'm not completely sure if it is the cleaning or shaking that makes the trick but it is certainly a strange problem.

                              Something that troubles me is that the computers remain untouched and they are not moved at all (for weeks) until the performance issue suddenly arises. I wonder if it is dust that start to collect in the PCIe slot or something else...