cancel
Showing results for 
Search instead for 
Did you mean: 

General Discussions

Reseller RMA Data Shows Fascinating Pattern Between AMD, Nvidia GPUs

Good reliability data is both highly prized in computing and frustratingly difficult to come by. Occasionally, a third party firm like SquareTrade will publish its own figures but these reports are few and far between. It’s effectively impossible to track how a manufacturer evolves from year to year without a set of consistent criteria and multi-year tracking. European reseller Mindfactory recently chose to share its GPU RMA data for AMD versus Nvidia products and the results are quite interesting.

I’ve written about Mindfactory’s data before and I’m willing to use them as a source for this article, but I want to note an important caveat that I don’t have an explanation for. According to this data set, Mindfactory sold very few RTX 2070s and 2080s, and only ever shipped a handful of SKUs. I suspect what this implies is that the data only covers the previous 12 months. That’s relevant if we’re going to draw any conclusions about the relative age of the process nodes these GPUs were built on. The report covers 44,100 AMD GPUs and 76,280 Nvidia GPUs, and is likely a statistically significant sample of all retail channel cards sold by either company in Europe for the relevant time period.

All of the usual caveats apply. Mindfactory is one European retailer. It isn’t a US company and its data is a snapshot of the total market, nothing else. These results should not be taken as determinative, they should be read with a grain of salt, contest not valid in Alaska or Hawaii, no participation necessary, see store for details, etc, etc. Moving on.

Here are the high-level takeaways from the chart, in no particular order:

  • Less-complex, less-powerful GPUs fail less often than more complex, more powerful GPUs.
  • AMD’s midrange and budget cards do not fail more often than midrange and budget cards from Nvidia.
  • PowerColor AMD GPUs fail more often than other brands.
  • The RTX 2080 Ti is the GPU statistically most likely to fail. It is the only GPU with two-digit failure rates (11 percent) reported from multiple vendors.
  • AMD high-end GPUs fail more often, in absolute terms, than Nvidia GPUs, even if we remove the impact of PowerColor from the AMD data. The gap is significantly smaller if you do, however.

Some years ago, a report came out showing failure rates between different types of RAM. If anyone can recall it, shoot me a link — I’ve not had any luck finding the article. What it showed was that it was more common for high-end enthusiast DRAM to fail than low-end basic parts from the likes of Kingston or Crucial. Failure rates didn’t correlate perfectly with clock, but as clock speed climbed, so did the RMA rate. The article I’m recalling wasn’t the Google 2009 study, or the 2012 follow-up, and I don’t think it was the Microsoft 2012 study, either. It was based on consumer hardware, not enterprise or server tech. The point was, enthusiast hardware running close to the margin of what’s possible has a higher failure rate than bog-standard parts that are well within clock and voltage margins.

Data by Mindfactory.de

We see evidence of a very similar trend here. If we assume that this data covers July 2019 – July 2020, it means that Nvidia was still having real problems with the RTX 2080 Ti when the GPU was nearly a year old, long after the company began shipping the card. Conversely, if the data set is from Turing’s launch, it would mean all it does is capture the already-known high launch failure rate for the RTX 2080 Ti.

I wish we had more data on the RTX 2070 and 2080, because the limited data we do have suggests some high rates of return on Gainward cards for the RTX 2080 and KFA2 cards for the RTX 2070. The RTX 2070 Super and RTX 2080 Super return rates are excellent. Are they excellent because Nvidia had months to refine Turing, or were they excellent from the beginning? The answer to that question would meaningfully impact how we interpret AMD’s higher RMA rates given that the 5700 XT and 5700 launched on a brand-new 7nm process.

The fact that we see a trend towards lower failure rates on simpler, smaller GPUs from both companies is very likely relevant. The RTX 2080 Ti’s higher failure rate fits with this — the chip was a reticle-buster that pushed engineering to its limit. As for the different manufacturer failure rates, we’ve got nothing but questions. Why did MSI’s Gaming Z Trio RTX 2080 Ti have a 1 percent failure rate with 2 returns (~200 GPUs sold), while the MSI Lightning Z had an 11 percent failure rate with 14 returns (~130 GPUs sold)?

Drastic variation in GPU failure rates could implicate the manufacturer’s cooling practices or reflect the fact that a company introduced new models of GPU over the course of a year and these later cards failed less often. Higher failure rates on AMD cards could reflect the fact that AMD pushes its GPUs closer to the edge of stability or that AMD’s OEM partners are willing to ride the ragged edge a little closer on AMD cards than on Nvidia because Nvidia has more authority and opportunity to play hardball (and to demand that its GPUs are properly supported). One of the reasons why AMD motherboards were historically less reliable than Intel boards was that AMD could neither force VIA to fix its bugs (like the infamous KT133A southbridge problem) or require motherboard vendors to devote an equal amount of time debugging and improving AMD motherboard BIOSes as they were willing to invest in Intel boards. Could a similar dynamic be at work here? It could be. The point is, we don’t know. No Sapphire GPU has more than a 2 percent failure rate, and 2 percent matches any Nvidia card. So is this an AMD problem or a PowerColor problem — but if we say it’s a PowerColor problem, was the 2080 Ti a multi-manufacturer issue or something specific to Nvidia?

This is why manufacturers don’t like releasing quality data. Questions beget questions beget questions. Even if we knew the relevant time period, we wouldn’t know when the GPUs Mindfactory sold were actually made. Maybe the retailer got a big batch of initial GPUs of every sort that failed and all failure rates today are basically equal (1-2 percent) between all cards and manufacturers. Maybe the failure rates have spiked recently because COVID-19 killed quality control and companies are just pumping out whatever they can sell. Without more information, we can’t know — and it’s that “more information” that companies don’t want to hand over in the first place.

Reseller RMA Data Shows Fascinating Pattern Between AMD, Nvidia GPUs - ExtremeTech 

3 Replies

Thanks for the share. Nice to read another take on this data.

I know that with the initial RTX line that other than the 2060 the higher RTX cards had abnormally high failure rate. This luckily was fixed for 2 of the next years refresh the 2070 super and 2080. A similar article to this one was posted from another tech site the other day. It gave some additional numbers I found interesting as it also broke it down by GPU type not just Brand like the chart above. The 2070 super has the lowest failure rate of all the cards, I think it was around 1.57%. So the above seems to paint a picture that only the lower cards have lesser rates where this card competes with the 5700xt directly and by contrast the one has the second worst failure rate vs the least failure rate. 

It isn't surprising that the failure rate on the 2080ti is high. After all it wasn't a refresh card and all those higher cards of that gen had issues as reported by many tech sites. Then of course you have the valid reason stated above. The top card always pushes the envelope and will likely always fail more because of it. AMD doesn't have this issues as they have not had a top card for a bit and I don't think the statistics above counted Vega series or Polaris. 

One thing I know is every time I go to Micro Center I check out the open box cards looking for my next bargain and that shelf is always full of returned 5700 & 5700xt and little of anything else. 

I wonder if somebody RMA's a card because of drivers or other software is it considered a failure? Or does it have to be an actual hardware failure?

0 Likes

I really don't know how they gauge it. I know for instance at Micro Center they don't put the card back on the open box shelf to resell without checking the card to some degree hardware wise. However I doubt that testing includes burn in time or pushing the card to limits that might indicate a flaw. Likely they just make sure if fires up, and that is a very small part of calling it "working". I know they guy had told me that they do send the cards back if they get returned a second time or they write them off. Guess it depends on which OEMs have what policy for return. A lot of people don't realize that retailers often get stuck eating the full price of something sold as the manufacturer doesn't warranty it to retailer as it does to the purchaser.  To me hardware or software is a failure either way as the end user can't use the product regardless. It only changes the point of failure. The driver is as much a part of the experience if not far more so than the physical hardware. Kinda like our bodies wouldn't be anything with our brain being filled with experience.