cancel
Showing results for 
Search instead for 
Did you mean: 

General Discussions

A Stability Problem Is Brewing With Nvidia RTX 3080, 3090 GPUs [Updated]

Recently there has been some discussion about the EVGA GeForce RTX 3080 series.

During our mass production QC testing we discovered a full 6 POSCAPs solution cannot pass the real world applications testing. It took almost a week of R&D effort to find the cause and reduce the POSCAPs to 4 and add 20 MLCC caps prior to shipping production boards, this is why the EVGA GeForce RTX 3080 FTW3 series was delayed at launch. There were no 6 POSCAP production EVGA GeForce RTX 3080 FTW3 boards shipped.

But, due to the time crunch, some of the reviewers were sent a pre-production version with 6 POSCAP’s, we are working with those reviewers directly to replace their boards with production versions. EVGA GeForce RTX 3080 XC3 series with 5 POSCAPs + 10 MLCC solution is matched with the XC3 spec without issues.

Reports have now come in with Asus TUF models and Nvidia RTX 3080 Founder Edition cards crashing as well.

Original Story Below: 

It’s not unusual to see some users posting problems when a new GPU or CPU launches, but there’s early data suggesting that some RTX 3080 and RTX 3090 GPUs have a stability problem when they push near-to or above 2GHz. Most reports have focused on 2GHz, but at least one user said his GPU was below that clock.

Reports have begun popping up online of stability problems likely tied to GPU boost frequencies. Known-affected models include the Zotac RTX 3080 Trinity and the MSI RTX Ventus 3X OC. MSI’s Gaming Trio is mentioned, as is EVGA’s RTX 3080 XC. These reports make approximately the same claim: The GPU crashes in one or more titles, often at around 2GHz. Reducing clock speeds can resolve the problem.

Igor of Igor’s Lab believes he has an explanation for the problem. After referencing Nvidia’s documents for the RTX 3080’s PCB design, he writes:

The BoM and the drawing from June leave it open whether large-area POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are used (marked in red), or rather the somewhat more expensive MLCCs (Multilayer Ceramic Chip Capacitor). The latter are smaller and have to be grouped for a higher capacity.

Bottom-POSCAP-vs-MLCC

Image by Igor’s Lab

The areas blocked off with boxes and the set of 10 green rectangles are all power rails. If the RTX 3080 and RTX 3090 are being fed dirty power, it would explain why the cards destabilize and crash at high frequencies.

 

Zotac, for example, used POSCAPS for all six rails:

Image by Igor’s Lab

Nvidia used four POSCAPS and two MLCCs, as shown below. Igor notes he’s been unable to crash his Founders Edition GPU, implying that this may be the cause of the crashing bugs. We don’t know for certain that the 3080 FE doesn’t have this problem, but so far nobody reporting an issue appears to have an RTX 3080 FE.

Founders-1

Image by Igor’s Lab

We don’t know how many cards are affected by this issue. Even if using POSCAPs instead of MLCCs is the cause of the problem, it doesn’t automatically follow that every POSCAP-equipped device has a problem. It seems more likely that a certain percentage of POSCAP devices would have issues than that Nvidia would fail to notice it had misinformed all of its AIBs and handed them bad manufacturing documents. The former could cause some elevated repair rates and grumbling, while the latter would require the repair of every RTX 3080 and RTX 3090 manufactured to-date.

If you run into this problem, the first thing to try is lowering the base clock on your GPU. Users are reporting that 80-100MHz always works. While overclocking your GPU is dangerous if you don’t know what you’re doing, it’s hard to muck up a part by running it slower.

This situation is going to evolve over the next few days as Nvidia discusses it with AIBs and investigates how best to fix the problem. I advise keeping an open mind as far as what the potential cause might be, not because I doubt Igor, but because sometimes manufacturing analysis reveals additional problems that weren’t previously known and couldn’t be seen from surface-level examination.

No matter what the problem turns out to be, Nvidia or its partners have a mess they’ll need to clean up while also dealing with extreme card shortages. We’ll keep you posted on whatever the cause turns out to be.

A Stability Problem Is Brewing With Nvidia RTX 3080, 3090 GPUs [Updated] - ExtremeTech 

0 Likes
1 Reply
leyvin
Miniboss

The issue is with any "Near" MSRP Cards., but it's certainly an interesting design flaw. 

Well I say flaw... it's not actually a real issue per se.

Rather the "issue" in question is that whenever the Frequency goes beyond Specification the EM Noise dramatically increases, which the Single Package Transistor Arrays aren't insulated from... as such this causes a fault within the Card that crash and reset the Drivers.

This wouldn't really be an issue if the Specification Clocks were actually what the GPU ran at (as in general this tend to occur at 1900MHz+ leading to stability issues., while keep in mind the Card itself is Specified to operate at 1720MHz., so it's STILL running almost 200MHz above specification)... but NVIDIA Precision Boost will automatically overclock a GeForce GPU if Power / Thermal Limit allows. 

As a note, this is typically why NVIDIA design their Power Delivery System the way they do.

That is to say., where it can only deliver up to 60% of the actual potential Draw and Utilisation.... the reason for this is to keep EM Noise relatively low but also limit how far Precision Boost can push the Card., so it doesn't simply hit a Frequency Limit and Crash (esp. with Lower Quality ASIC).

I personally dislike this approach, and I wasn't happy when AMD adopted it as well for the RX 5000-Series.

Yes., it's "Nice" that my RX 5700 XT will contently run at 2050MHz (most of the time)., but given that the Boost Frequency is 1905MHz while the Game Frequency is 1760MHz... 

Well it's frustrating.

What I mean is., AMD are ONLY guaranteeing that my Card will hit 1760MHz for Gaming; while saying it "Can" Boost up to 1905MHz... but then my Card actually runs beyond that, and so it's performance ends up non-representational for all RX 5700 XT. 

In fact I'm well aware that the performance I get on mine is ABOVE what the Avg. RX 5700 XT will achieve.

Yes, we're talking ~5% but that shouldn't be the performance variation between 2 cards of the same Class.

AMD always used to ensure that the same Cards operated at the same Frequencies., and anything "Extra" was achieved through manual overclocking. 

This meant that all of them would perform identically "Out-of-the-Box" but would OC differently... and that's fine.

But then look at NVIDIA., where they've had this Boost Technology for about a Decade... there is no way to believe Review Scores as they can easily have 5-8% over Specification; and what you're being promised as a guarantee from NVIDIA.

This gives a false sense of the performance to expect. 

Imagine if instead GPU Boost was something you enabled in the GeForce Experience for Automatic Overclocking (which I'm not against., if it doesn't void the Warranty)... well the consumer reaction would've been very different.

It would be an issue ENTIRELY limited to those Overclocking their Hardware., and Avg. Consumers who never Overclock (or touch the Drivers) wouldn't even be aware of it as a problem.

The reason this does exist is because of NVIDIA' own dishonest business practises., and things like XFR, Precision Boost, Turbo Boost, etc. these are dishonest and misrepresentative of the products you're being sold.

Because those are NOT what are being promised and guaranteed by the Company., it's merely an added bonus; which is fine to have... but it's WRONG to sell knowing full well it's something EXPECTED from the Product by the Consumers when it's not what you're guaranteeing / promising to deliver to them.

Again... I have no issues with these being OPTIONAL things for Radeon / Ryzen., but I think that the Out-of-the-Box Behaviour should be to hit the Advertised (Promised) Frequencies.

What I see as being beneficial to the End-User that could be automated, would be Undervolting.