Was wondering if AMD devs have any insights into this observation:
I've observed that running my benchmarks on GPU1 runs significantly faster than GPU0. I've noticed GPU0 can be 2 and sometimes 4 times longer than GPU1 in computation time. I assumed this was because GPU0 is busy with Xorg on the OpenSUSE 13.2 64bit installation I'm running on (3.0 APP SDK on catalyst 14.12). I eventually kill Xorg to test without anything else running on the GPU (to the extent I can control) but it is still keeping these timings.
My computations on GPU1 for an certain image processing kernel (secret sauce) are a stable 2 ms per call, on GPU0 it is 5ms but eventually sticks up to 10ms - with no code/build changes! On another kernel (modified radix sort) this behaviour is again duplicated - 14ms for GPU1, 28ms for GPU0 sometimes and almost 70ms others. Input data does not vary either in these tests and these should be nearly deterministic in execution time as the algorithms are data independent and have no element of randomness.
This is a big problem if it is the loss from the display driving GPU for single GPU embedded systems I'm building but its troublesome in the now because I have alot of real-time work I need to do on these GPUS and am counting on full utilization with my programs.
Any idea what's going on?
Hi and Merry Christmas,
Try rerunning your benchmarks while running in another console (in a script
It could also give you a useful baseline of your cards, if you try it b4
running your tests.
I suspected power management might be to blame or throttling because the cards are the same but I never tested this directly - the benchmarks to get these timings are 7 seconds long and I run them back to back or extend the numbers - it doesn't really change it the timing per call after the first second or so (you can indeed see the numbers change while the cards ramp up). Unfortunately the m290x's are still "not supported" by amdconfig despite 14.12 and the drivers been working well for a while (I've posted this bug to one of these amd boards before):
DISPLAY=:0 amdconfig --odgc
amdconfig: No supported adapters detected
DISPLAY=:0 amdconfig --odgt
amdconfig: No supported adapters detected
opencl works on the gpus obviously as does amdcccle. I'm remote right now thus the DISPLAY=:0 - this all works on remote desktop under duplicate configuration except its a single AMD Radeon HD 7900 Series - I just tested that.
If its of value I can post some timing plots showing a ramp down in time per call but I can't figure anything else. In CodeXL performance counters were showing lower - I believe occupancy went down from 100 to 70. Not 100% sure that was the case but I've been trying to figure this out here and there for 2 weeks.
Too bad amdconfig doesn't support your cards. An independent
monitoring/validation of your problem would be very useful. Can you get
some figures from CodeXL? I suspect a hardware issue. I use a Sapphire R9
277 card with Ubuntu 14.04 x64. amdconfig is compatible with my setup. I am
using webmail (gmail) to reply, so I'm using full Xorg with 4 virtual
Default Adapter - Supported device 6811
Core (MHz) Memory (MHz)
Current Clocks : 300 150
Current Peak : 945 1400
Configurable Peak Range :
GPU load : 0%
Sometimes it might get up to 3%, but this is spurious, too fast to identify
what is causing it. What I'm trying to say is that all Ubuntu resources,
don't amount anything for my (and hopefully your) card. So, I'm suspecting
a hardware issue, card or bus. Can you try switching the cards and report
But I missed you have a xfire setup. I don't know about it and don't use
it, but I imagine it must have some overhead and must run off somewhere.
Could it be that 30% off your first card?
If that is the case, maybe for opencl, it is better to use 2 independent
cards (processors) instead of a single crossfired one
CodeXL will have to wait till next week as I can't run it remotely - it crashes on gl related problems with x2go and doesn't support newer glx protocols anyway.
Re crossfire, afaik it's not as complicated as it once was, just point to point communications over pcie bus. I doubt any overhead is going on there. I was considering maybe the card knows it's a display driver and reserves some of itself or something or hardware issues but was hoping AMD team could help me poke around better. It is a real bummer amdconfig still isn't working. Judging from my desktops where I've used amdconfig to watch GPU load, I would say like you generally not doing much on the desktop will not incur much on the average. I actually replicated the timing with Xorg shut down - going headless so to speak.
For switching cards, I'm supposing you mean switch which one is driving the display? I don't think I can do that since it's a laptop.
I meant switching the PCI bus, not just the monitor cable. Didn't know
about the laptop, and I imagine it is tough to do.
You can switch display cards through "Devices" section in
I would imagine that xfire is a bit more than point 2 point communications
over PCIE (schedulers, load balancers, cache mappers?)
I've collected profiling information for both GPUS. I can't post the kernel sourcecode here but I've replicated the timing behavior across a few of my kernels which all are different work flows.
GPU1 is the m290 GPU whos timings are good and consistent with my desktop R9 290. GPU0 is the m290 one who's behaving a few times slower.
I've noted that most fields of the profilings are the same but it is easy to see the last few columns from VALUBusy to LDSBankConflict (1.17 on GPU0 and 3.32 on GPU1) and VALUBusy have some significant differences - differing by factors of 2-3 which is around how many times kernels are running slower on that card.
prepost update: I just stumbled on this thread which works around the problem: http://devgurus.amd.com/thread/169896 by setting env variable GPU_NUM_COMPUTE_RINGS=1
I guess at this point: wtf @ regression testing on amd's side.
I have an update to this that it should not be relied on - I've observed for several kernels incorrect results from GPU0 and for iterative kernels this can also lead to deadlock. I've also noted sometime later the effect of this work-around on timing seems to go away and said GPU becomes slow to execute again.