Nvidia Ampere Discussion [2020-05-14]

Voxilla · Sep 26, 2020

This is turning into a pretty disastrous product launch.
Good thing is all the criminal resellers are now stuck with unsalable cards.

CarstenS · Sep 26, 2020

Voxilla said:
This is turning into a pretty disastrous product launch.
Good thing is all the criminal resellers are now stuck with unsalable cards.

Yep, quite happy now, that I didn't get a 3090 in the first round. With the second round probably still a week or two away, I think I'll rather wait for what AMD has in store - or I stay with my Vega56 for another gen. Mostly playing Defense Grid and Talos Principle anyway... but then, Nov. 19th, Cyberpunk 2077 is coming.

Scott_Arm · Sep 26, 2020

I kind of wonder if some of these cards aren’t stop-ship while they test some changes to the bom. Might explain why quantities are limited.

CarstenS · Sep 26, 2020

Might not be the worst of ideas.

Kaotik · Sep 26, 2020

Ext3h said:
Yes, and also reported cases for MSI and EVGA models which used the same parts list as the Founders Edition. Albeit not reported at the same frequency as models using POSCAP on the NVDD rail too. May be same issue, but with a higher error margin, may be unrelated issues. Difficult to tell apart from PSU problems.

One unknown user (assuming it's not fake, Igor didn't link to a source) apparently got a Zotac GPU stable at boost clocks by replacing POPSCAP on NVDD by MLCC group, which provides a strong point in case. Assuming that user had sufficient knowledge about electrical engineering not to fall for PSU issues.

As it turns out, they're not even POSCAPs, they're SP-CAPS which are both more expensive and better.
Also apparently the cap configuration varies card by card even within specific models, as Jayz pointed out in his video on Twitter

https://twitter.com/x/status/1309617232201175040

Ext3h · Sep 26, 2020

CarstenS said:
Might not be the worst of ideas.

Not many other options, even if that means that the planned launch stock has to be rebuilt (meaning at least 2 shipping round-trips before anything fixed ends up on the shelf), and all the broken models have to be flashed with a throttled firmware and rebranded.

I wouldn't be surprised at all if we were seeing the faulty 3080/3090 models again as "3070" or "3070 Ti", including their oversized PCBs, at most stripped from their coolers. I don't see any other option to prevent a total loss of the inventory.

As for the models already out there, a recall is the only option. Throttling the product when already owned by the customer would end up in lawsuits.

On the bright side, at least some AIBs (Asus, maybe others as well?) got it right before shipping their first batch.

For the remainder of the AIBs and NVidia themselves, this is going to leave a huge dent in the financial projections for 2020.

Kaotik said:
As it turns out, they're not even POSCAPs, they're SP-CAPS which are both more expensive and better.

More expensive than POSCAPs, but still cheaper and less suited for high frequency applications compared to a whole array of 10 MLCC.

Nice digest of the whole topic in the current state:

https://www.reddit.com/r/hardware/comments/izmi1k

EDIT1:
https://forums.evga.com/m/tm.aspx?m=3095238
Official statement from EVGA, reviewers got faulty models, all production units are supposed to be cleared from this specific issue. The claim that 1 MLCC group is sufficient still needs to be validated though, as failures are still reported in the wild.

EDIT2:
And a couple of electrical engineers are voicing misgivings regarding MLCC too, as it's prone to aging, voltage and temperature related issues. If the MLCC groups end up failing too (over time, as it's doubtful whether they have sufficient safety margin), then this may yet turn into a perfect disaster.

EDIT3:
As to why the doubt about safety margin, some vendors have only 220uF capacity per group, some 330uF, some went for a more conservative 470uF per group. EVGA appears to be in the 220uF category. For comparison, Founders Edition uses 470uF per group on NVVDD rail, 220uF per group on MSVDD rail. Asus models are all 470uF.

EDIT4:

https://twitter.com/x/status/1309659834468298753

And that's an Asus TUF failing for a reviewer. So apparently it's not all about the caps, even though they do play a role as EVGA confirmed unambiguously. Coincidentally, some 20 series owners also report similar crashes with 30 series launch drivers though, so may as well be bad drivers as a cherry on top.

gongo · Sep 26, 2020

Voxilla said:
This is turning into a pretty disastrous product launch.
Good thing is all the criminal resellers are now stuck with unsalable cards.

If the crashing happens at 2Ghz and above, that is already an overclock.
Why should Nvidia be mindful of it?
I want my i9 to hit 5.2Ghz, if it does not, i cannot blame Intel or Asus.

Kaotik · Sep 26, 2020

gongo said:
If the crashing happens at 2Ghz and above, that is already an overclock.
Why should Nvidia be mindful of it?
I want my i9 to hit 5.2Ghz, if it does not, i cannot blame Intel or Asus.

Because their Boost-algorithm allows the GPU boost that high from stock if the cooling and powerlimits allow it

Ext3h · Sep 26, 2020

gongo said:
If the crashing happens at 2Ghz and above, that is already an overclock.
Why should Nvidia be mindful of it?
I want my i9 to hit 5.2Ghz, if it does not, i cannot blame Intel or Asus.

The initial report on the Nvidia forums was about a crash to desktop at a 2Ghz boost, hard wall.
Since then, plenty user had reported a "me too", but they did not overclock (or at least only OEM overclock) their GPUs.

Certain models, like Zotac's lineup, are running into the issue even at stock clocks (not even OEM overclocked), at an alarming rate.
Which is coincidentally also what a couple of reviewers experienced with their pre-production models from EVGA and Colorful.
And that is what triggered an interest into why the different models have such significantly differing OC potential, respectively stability issues under normal operating conditions.

But what the interest in the differing part list unveiled, is most likely the cause for the shortage of 3080 GPUs. Several vendors fell for the same trap as Zotac, and produced GPUs which can not even run stable at stock clocks of 1440/1710 Mhz.

That's the reason why for some vendors, availability of their models has been pushed back by a month or two. They are busy recalling their on-route stock, and now have to rebuild their lineup of 3080 and 3090 cards. And I am not just talking about Gigabyte and Zotac who carried the faulty design into production, but also a number of other vendors who caught the issue just in time before the GPUs hit the shelf, and have now effectively lost their launch day inventory. Actually, Colorful even admitted that this was the reason their GPUs haven't entered market yet, after reviews showed instabilities.

But that's not all, yet. Because it appears that not all GPUs are running stable even with the "middle ground" NVidia has chosen with their "reference" (not actually reference, but just one possible interpretation of their ambiguous specification) design. Which indicates that even the "fixed" design will likely end up with a significant share of GPUs which are defective on arrival.

All the OEMs effectively flying blind with regard to actually stable clock speeds, in lack of drivers for real world validation of their designs, didn't exactly help alleviate the issue ahead of time. Pushing the power envelope on a single die that far was a horrible idea, neither NVidias engineers, nor OEM engineers had sufficient experience with that.

It's going to be a month or two before stocks reach the level they were supposed to at on launch day (goodbye Christmas business), and even then it's a lottery whether your 3080 / 3090 will be one running stable or not (or if it will still run stable in a few months from now, because the circuit is still stressed to the limit).

Kaotik said:
Because their Boost-algorithm allows the GPU boost that high from stock if the cooling and powerlimits allow it

And that also adds to that... Even though at least that part could be fixed with a driver / firmware update, putting a hard cap on boost clocks independently from base clocks. And probably also cutting all the "OC" models down to base clocks, where the legal issues arise.

gongo · Sep 26, 2020

Kaotik said:
Because their Boost-algorithm allows the GPU boost that high from stock if the cooling and powerlimits allow it

But i read at Igorlabs, the crashing happens at 2Ghz and more.
I am not aware the 3080 can hit 2Ghz without an offset overclock.
The next post mentioned, stock clocks Zotac 3080 is still crashing, could that be an isolated card issue?

Scott_Arm · Sep 26, 2020

They can hit near 2ghz just not sustained. They’ll boost for a few frames depending on a workload, just long enough to crash. I imagine the boosting behaviour just needs to be tweaked. Sounds like more of a firmware issue to me than something where cards would need to be rebuilt.

Scott_Arm · Sep 26, 2020

You guys may also want to watch the buildzoid video. Literally none of the cards have Panasonic poscaps.

trinibwoy · Sep 26, 2020

It would seem the lack of scaling from the 3080 to the 3090 isn't that complicated. Looking at Techpowerup's 3090 data, the average 3090 boosts to about 100Mhz less than the 3080 FE. With the appropriate encouragement (i.e. higher power limit) the 3090 hits similar clocks and scales pretty well, e.g. on the ASUS model.

With 3080 clocks at ~1930 and the 3090 at ~1920 the 3090 advantage comes out to be:

4K FPS: +19%
Flops: +20%
Bandwidth: +23%
Fillrate: +16%

https://www.techpowerup.com/review/asus-geforce-rtx-3090-strix-oc/32.html

Cyan · Sep 27, 2020

Scott_Arm said:
They can hit near 2ghz just not sustained. They’ll boost for a few frames depending on a workload, just long enough to crash. I imagine the boosting behaviour just needs to be tweaked. Sounds like more of a firmware issue to me than something where cards would need to be rebuilt.

maybe it's not much, but it's such a very bad sign that those cards should be replaced. How many months can they hold up until they get toasty or damaged? I mean, for someone who spent 800€-1500€ on a card that's a pretty serious issue. They can tweak the bios, but those capacitors are still limited. That was caused 'cos of nVidia's secrecy, which I am fine with, but it's obvious that they are failing and they must be replaced even if you have to stay below certain thresholds, with more use it can get worse.

Cyan · Sep 27, 2020

Ext3h said:
Not many other options, even if that means that the planned launch stock has to be rebuilt (meaning at least 2 shipping round-trips before anything fixed ends up on the shelf), and all the broken models have to be flashed with a throttled firmware and rebranded.

I wouldn't be surprised at all if we were seeing the faulty 3080/3090 models again as "3070" or "3070 Ti", including their oversized PCBs, at most stripped from their coolers. I don't see any other option to prevent a total loss of the inventory.

As for the models already out there, a recall is the only option. Throttling the product when already owned by the customer would end up in lawsuits.

On the bright side, at least some AIBs (Asus, maybe others as well?) got it right before shipping their first batch.

For the remainder of the AIBs and NVidia themselves, this is going to leave a huge dent in the financial projections for 2020.

More expensive than POSCAPs, but still cheaper and less suited for high frequency applications compared to a whole array of 10 MLCC.

Nice digest of the whole topic in the current state:

https://www.reddit.com/r/hardware/comments/izmi1k

EDIT1:
https://forums.evga.com/m/tm.aspx?m=3095238
Official statement from EVGA, reviewers got faulty models, all production units are supposed to be cleared from this specific issue. The claim that 1 MLCC group is sufficient still needs to be validated though, as failures are still reported in the wild.

EDIT2:
And a couple of electrical engineers are voicing misgivings regarding MLCC too, as it's prone to aging, voltage and temperature related issues. If the MLCC groups end up failing too (over time, as it's doubtful whether they have sufficient safety margin), then this may yet turn into a perfect disaster.

EDIT3:
As to why the doubt about safety margin, some vendors have only 220uF capacity per group, some 330uF, some went for a more conservative 470uF per group. EVGA appears to be in the 220uF category. For comparison, Founders Edition uses 470uF per group on NVVDD rail, 220uF per group on MSVDD rail. Asus models are all 470uF.

EDIT4:

https://twitter.com/x/status/1309659834468298753
And that's an Asus TUF failing for a reviewer. So apparently it's not all about the caps, even though they do play a role as EVGA confirmed unambiguously. Coincidentally, some 20 series owners also report similar crashes with 30 series launch drivers though, so may as well be bad drivers as a cherry on top.

darn!!! I thought the ASUS TUF were the safest, better new nVidia GPUs out there yet, and costing like the Founders Edition, the cheapest in the market. Disappointed.

Scott_Arm · Sep 27, 2020

@Cyan It's definitely not good if cards are crashing, but I don't think we know that the cap selection is actually bad. Any card will crash if you push the frequency too high. If their boosting algorithm is a little too aggressive, that's all it'll take to make the card reset.

hughJ · Sep 27, 2020

I'm looking forward to AIB partners advertising next year's batch of cards like Kellogg's Raisin Bran.

Cat Merc · Sep 27, 2020

Igor's lab testing showed spikes of over 500W at <1ms. Could it be that the firmware or drivers are missing some hard limiter that would stop the card from going that high? If memory serves Turing and Pascal wouldn't peak that high relative to their averages.

LeStoffer · Sep 27, 2020

Scott_Arm said:
@Cyan It's definitely not good if cards are crashing, but I don't think we know that the cap selection is actually bad. Any card will crash if you push the frequency too high. If their boosting algorithm is a little too aggressive, that's all it'll take to make the card reset.

Cat Merc said:
Igor's lab testing showed spikes of over 500W at <1ms. Could it be that the firmware or drivers are missing some hard limiter that would stop the card from going that high? If memory serves Turing and Pascal wouldn't peak that high relative to their averages.

Well, Samsung 8nm process wasn’t actually designed for big power hungry chips, so I wouldn’t be surprised if the lower bin parts just isn’t cutting the mustard. DDR6X isn’t exactly mature technology either. Things can quickly come very close to the edge!

Is the cards Slightly premature? Maybe.

CarstenS · Sep 27, 2020

Cat Merc said:
If memory serves Turing and Pascal wouldn't peak that high relative to their averages.

Short term peaks where at 430-460 watts (depending on the card) with 2080 Ti also, given their lower TDP rating, relatively speaking, they did not peak substantially lower.

Nvidia Ampere Discussion [2020-05-14]

Voxilla

CarstenS

Moderator

Scott_Arm

CarstenS

Moderator

Kaotik

Drunk Member

Ext3h

gongo

Kaotik

Drunk Member

Ext3h

gongo

Scott_Arm

Scott_Arm

trinibwoy

Meh

Cyan

orange

Cyan

orange

Scott_Arm

hughJ

Cat Merc

LeStoffer

CarstenS

Moderator

Similar threads