Question: In case it's our code, is there a sample program with source that allows us to reliably test the processing bandwidth of multi-GPU at once
Context:
We are having issues with 4 X 7970 setup (same with 3)
We dump some data on 4 cards, execute a loop and export timings (no PCIe - domain transfers included in perf timings), we have a thread per card and all is totally independent (what happens on a card stays on a card)
Result: One of the 4 cards run at full speed, the other ones about half speed (all openCL stuff)
Full Speed is the speed of 1 card with the other unplugged.
The card that is faster is the one the monitor is connected to (we don't push any pixels out though, it's all command line)
We test that by replugging the monitor cable.
If we unplug all the cards except one (anyone) then that card is full speed.
This is using the latest beta driver and profiles (which somehow seemed to have given us a tiny speedup).
This is a config with a single SB extreme 6 cores. I understand I couldn't get full PCIe transfer speed under such condition but I am wondering if something somewhere (i.e. driver) makes some assumptions as it's a 2X scaled down in terms of processing speed. (Many other thing tested, this is the short form).
Pierre
Hi. Not sure but it sounds like the headless clock problem in this thread: http://devgurus.amd.com/thread/159062
All display drivers since about 8.96 have this problem where headless 7970 GPU's are stuck at a clock speed of 500 MHz.
No setting or tweaking tool can fix this, if it can the clock immediately reverts back to 500 MHz.
Thus, the drivers are not very useful for multi-GPU work.
I'm still using version 8.92.
All versions of 8.95 do not have the headless clock problem, but I had other issues with them.
Someone here suggested a temporary fix using dummy VGA dongles.
The problem has been widely reported but so far I've seen no feed back.
It might be considered a driver problem rather than an OpenCL problem.
Allan
Btw did you tested for arithmetic correctness while over-clocking those 7970's?
I have some calculation errors when I use NOT the first drivers (win 11.12, linux 12.1). It occurs after 1 minutes of stress tests when the gpu temperature goes above 80 celsius. But I'm not sure it's a memory transfer error, or another bug of mine.
Soon I'll make a test thing for this, but this is just weird:
At the moment I have these result while overclocking two 7970's to 1125 MHz:
first driver: CAL test goes without errors, but -2-3% performance drop if you use 2 GPUes.
first driver: OpenCL runs perfect while running on 1 GPU, but there is a terrible -50% performance drop while running OCL with 2 GPUes.
latest driver: OpenCL runs awesome on 2x gpues without penalty for multi-gpu. Also CAL runs at the usual -2-3% performance degradation with 2x gpues. BUT after 1 minute (temperature above 80 celsius) there are an increasing number of calculation errors while the temperature goes up (to 86celsius)o.O.
Did you experienced such errors?
We have slightly different behavior then you so far: (assuming we derive
proper intuitions based on our own testing)
1) OLDER driver is 20-30% faster on one machine then the other (same
exact card, faster on the small i5 mobo than the big sandy bridge
extreme mobo we use for 4 GPU testing),
AND LATEST seems to fix that for a single GPU here (finally matching
perf for same card)
I don't fully understand how all the parts interact, even wondering if
that is just Profiles.xml related?
2) both drivers are 50% slower for additional GPU (no change), but all
cards seem to get the 20-30% speedup here with latest driver (e.g.
instead of 4 FPS,2,2,2 it's now like 5,2.5,2,5,2,5 where 5 is the one
with the monitor cable connected).
3) We haven't check to see our results match yet as we were testing
without any output (i.e. no memory back returned to host, any PCIe
transfers...) to isolate issues, but we wondered about it... - will
do, that would suck. Is this compute error thing you see happening even
when the fans are running full spin/manual? Everything does slow down
here after it gets to a certain temperature but symmetrically as far as
I can tell (after a test unit that lasts maybe 2 minutes here basically,
we have to wait 10-15 minutes to get max perf on another run or reboot,
first time is usually fast before something starts to be temperature
conscious. Running the fans has some impact on peak speed.
4) Question, do you have something connected to the second GPU that you
say goes fast now? (will also try to "terminate" the cards with dummy
vga dongles as suggested in previous email, also in case got some extra
crossFire cables in case the driver pays attention to that for some reason).
5) LuxMark OpenCL appears to work fine on multi-GPU (more like 3X one
card with 4 GPU ) anyway that adds to our confusion even more. But, if I
look here: http://www.luxrender.net/luxmark/
.
Question/Suggestion for AMD: I imagine multi-GPU compute has
internally been tested on some reference machine? What/How? We're
getting frustrated here
Could there be an SDK sample demo that tests just that? Having that
could help "crowdsource" resolving such issues.
That: Pin Some memory, copy to all cards, loop for a while just some
memory copy and some arithmetic will do (like ~100ms of compute/copy on
card per frame in loop will do), spit timing per card every 1000 frames
(iterations) -- basically the most casual file-based workflow type
application (i.e. no xfire sharing, no tiling out of a scene out like
high-res interactive/gaming/video wall applications, just autonomous
compute per card).
Pierre
I always turn crossfire on on windows. If I don't I guess that the other card will be turned off if no display connected to it.
On linux I do nothing except COMPUTE=:0 and both 7970's are there.