4 Replies Latest reply on Jun 25, 2012 5:29 AM by realhet

    Quad GPU question (4 X 7970)

    revisionfx

      Question: In case it's our code, is there a sample program with source that allows us to reliably test the processing bandwidth of multi-GPU at once

       

      Context:

      We are having issues with 4 X 7970 setup (same with 3)

      We dump some data on 4 cards, execute a loop and export timings (no PCIe - domain transfers included in perf timings), we have a thread per card and all is totally independent (what happens on a card stays on a card)

      Result: One of the 4 cards run at full speed, the other ones about half speed (all openCL stuff)

      Full Speed is the speed of 1 card with the other unplugged.

      The card that is faster is the one the monitor is connected to (we don't push any pixels out though, it's all command line)

      We test that by replugging the monitor cable.

       

      If we unplug all the cards except one (anyone) then that card is full speed.

      This is using the latest beta driver and profiles (which somehow seemed to have given us a tiny speedup).

       

      This is a config with a single SB extreme 6 cores. I understand I couldn't get full PCIe transfer speed under such condition but I am wondering if something somewhere (i.e. driver) makes some assumptions as it's a 2X scaled down in terms of processing speed. (Many other thing tested, this is the short form).

       

      Pierre

        • Re: Quad GPU question (4 X 7970)
          drallan

          Hi. Not sure but it sounds like the headless clock problem in this thread: http://devgurus.amd.com/thread/159062

           

          All display drivers since about 8.96 have this problem where headless 7970 GPU's are stuck at a clock speed of 500 MHz.

          No setting or tweaking tool can fix this, if it can the clock immediately reverts back to 500 MHz.

          Thus, the drivers are not very useful for multi-GPU work.

           

          I'm still using version 8.92.

          All versions of 8.95 do not have the headless clock problem, but I had other issues with them.

          Someone here suggested a temporary fix  using dummy VGA dongles.

          The problem has been widely reported but so far I've seen no feed back.

          It might be considered a driver problem rather than an OpenCL problem.

           

          Allan

          • Re: Quad GPU question (4 X 7970)
            realhet

            Btw did you tested for arithmetic correctness while over-clocking those 7970's?

            I have some calculation errors  when I use NOT the first drivers (win 11.12, linux 12.1). It occurs after 1 minutes of stress tests when the gpu temperature goes above 80 celsius. But I'm not sure it's a memory transfer error, or another bug of mine.

            Soon I'll make a test thing for this, but this is just weird:

            At the moment I have these result while overclocking two 7970's to 1125 MHz:

            first driver: CAL test goes without errors, but -2-3% performance drop if you use 2 GPUes.

            first driver: OpenCL runs perfect while running on 1 GPU, but there is a terrible -50% performance drop while running OCL with 2 GPUes.

            latest driver: OpenCL runs awesome on 2x gpues without penalty for multi-gpu. Also CAL runs at the usual -2-3% performance degradation with 2x gpues. BUT after 1 minute (temperature above 80 celsius) there are an increasing number of calculation errors while the temperature goes up (to 86celsius)o.O.

            Did you experienced such errors?

              • Re: Quad GPU question (4 X 7970)
                revisionfx

                We have slightly different behavior then you so far: (assuming we derive

                proper intuitions based on our own testing)

                 

                1) OLDER driver is 20-30% faster on one machine then the other (same

                exact card, faster on the small i5 mobo than the big sandy bridge

                extreme mobo we use for 4 GPU testing),

                AND LATEST seems to fix that for a single GPU here (finally matching

                perf for same card)

                I don't fully understand how all the parts interact, even wondering if

                that is just Profiles.xml related?

                 

                2) both drivers are 50% slower for additional GPU (no change), but all

                cards seem to get the 20-30% speedup here with latest driver  (e.g.

                instead of 4 FPS,2,2,2 it's now like 5,2.5,2,5,2,5 where 5 is the one

                with the monitor cable connected).

                 

                3) We haven't check to see our results match yet as we were testing

                without any output (i.e. no memory back returned to host, any PCIe

                transfers...) to isolate issues, but we wondered about it... - will

                do, that would suck. Is this compute error thing you see happening even

                when the fans are running full spin/manual? Everything does slow down

                here after it gets to a certain temperature but symmetrically as far as

                I can tell (after a test unit that lasts maybe 2 minutes here basically,

                we have to wait 10-15 minutes to get max perf on another run or reboot,

                first time is usually fast before something starts to be temperature

                conscious. Running the fans has some impact on peak speed.

                 

                4) Question, do you have something connected to the second GPU that you

                say goes fast now? (will also try to "terminate" the cards with dummy

                vga dongles as suggested in previous email, also in case got some extra

                crossFire cables in case the driver pays attention to that for some reason).

                 

                5) LuxMark OpenCL appears to work fine on multi-GPU (more like 3X one

                card with 4 GPU ) anyway that adds to our confusion even more. But, if I

                look here: http://www.luxrender.net/luxmark/

                .

                 

                Question/Suggestion for AMD:  I imagine multi-GPU compute has

                internally been tested on some reference machine?  What/How?  We're

                getting frustrated here

                Could there be an SDK sample demo that tests just that? Having that

                could help "crowdsource" resolving such issues.

                That: Pin Some memory, copy to all cards, loop for a while just some

                memory copy and some arithmetic will do (like ~100ms of compute/copy on

                card per frame in loop will do), spit timing per card every 1000 frames

                (iterations) -- basically the most casual file-based workflow type

                application (i.e. no xfire sharing, no tiling out of a scene out like

                high-res interactive/gaming/video wall applications, just autonomous

                compute per card).

                 

                Pierre