Speculation and Rumors: Nvidia Blackwell ...

T2098 · May 23, 2024

T2098 said:
GT200 was 512-bit as well, but the photos of those I can find from back in the day were also 8 chips on each side of the PCB:

Zotac GeForce GTX 280 Amp! Edition Review

Today NVIDIA launches their new GTX 280 graphics cards. Zotac as one of NVIDIA's premier partners has engineered a factory overclocked version of the GTX 280 which offers additional performance. Overall the Zotac GTX 280 Amp Edition is the fastest graphics card on the planet today, but also the...

www.techpowerup.com

Actually, scratch that, there were versions with 16 chips on the frontside too, like this one:

Leadtek GeForce GTX 285 1024 MB Review

NVIDIA's latest addition to their graphics card lineup is based around the 55 nanometer GT200b. The new card offers more performance, consumes less power and is quieter than the previous model. Yet these improvements cost you a premium of about $70 over the GTX 280. Is that price justified?

www.techpowerup.com

GDDR3 was a lot more relaxed with trace length and routing than GDDR6/7 though; I'm not sure the trace lengths there would fly nowadays.

Seanspeed · May 23, 2024

Arun said:
I'm a bit skeptical it's 512-bit, but if it is, then with GDDR7 and the maximum number of DRAM chips, they could go up to 96GiB on professional cards which is more than the original H100... that'd be quite interesting! As long as they don't include NVLink, there's still strong differentiation for B100/GB200, and they'd probably limit the consumer/GeForce version to 32GiB as well (half as many chips, and 16Gbit rather than 24Gbit).

It wouldn't be too shocking according to some of the rumors that the 5090/GB202 is basically just a consumer/light business focused AI GPU more than anything.

Frenetic Pony · May 23, 2024

I wonder what TDP the big AI version and the consumer version will run at?

For AI the obvious choice is 600w, but for consumer? 600w isn't impossible to run under aircooling today, but I'm under the impression that's only possible on highly engineered server cases, 450w was the limit not that long ago. I'll stick with 525w, as that would be PCIE5(75w) slot + 450w cable draw and would be on the edge of possible for like, a quad slot cooler with one of the fancy new expensive thermal pastes. I think?

DegustatoR · May 24, 2024

Frenetic Pony said:
I wonder what TDP the big AI version and the consumer version will run at?

450W unless they'll decide to downclock it due to a lack of competition.

TopSpoiler · May 24, 2024

DegustatoR said:
450W unless they'll decide to downclock it due to a lack of competition.

The power-to-performance sweet spot for the RTX 4090 is around 280W. The 450W sweetspot target for the RTX 5090 sounds quite attractive.

Erinyes · May 24, 2024

Arun said:
I'm a bit skeptical it's 512-bit, but if it is, then with GDDR7 and the maximum number of DRAM chips, they could go up to 96GiB on professional cards which is more than the original H100... that'd be quite interesting! As long as they don't include NVLink, there's still strong differentiation for B100/GB200, and they'd probably limit the consumer/GeForce version to 32GiB as well (half as many chips, and 16Gbit rather than 24Gbit).

Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's. So they wouldn't need to design and tape out a 512 bit limited volume chip. If NV has managed to make it work with B100, could work for consumer graphics as well perhaps?

Arun · May 24, 2024

Erinyes said:
Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's. So they wouldn't need to design and tape out a 512 bit limited volume chip. If NV has managed to make it work with B100, could work for consumer graphics as well perhaps?

That's possible, but I'm not 100% sure whether NVIDIA ever actually solved making graphics work across their split L2 GPUs. On H100, we know graphics performance is intentionally terrible (not clear if they genuinely managed to save most of the silicon and it's only present in a single GPC, or if they are just doing yield recovery so they don't lose chips when the graphics parts which are still there are defective)... But A100 is a bit more mysterious to me.

The only public benchmark of A100 for Graphics I could find is... *sigh* GFXBench:
- A100: https://gfxbench.com/device.jsp?D=NVIDIA+A100-PCIE-40GB&testgroup=overall
- RTX 3090: https://gfxbench.com/device.jsp?D=NVIDIA+GeForce+RTX+3090&testgroup=overall

Ignore Tessellation/ALU2 because they're just broken on very fast chips afaict, but everything else that I'd half-trust-on-a-good-day is just terrible for a chip that size, e.g. texturing test (ignore absolute numbers, I know exactly how it works, and it's a bit silly - but relative should be fine on the same architecture) and Manhattan 3.1.1 1440p Offscreen.

So I did some quick calculations... A100 40GiB PCIe has 7 out of 8 GPCs active with 82 SMs in total. Let's assume they *don't* support graphics using both sides of the GPU, and they use the side with 4 GPCs active, so performance should be 4/7th of peak, i.e. ~61 SMs out of 108. That compares to 82 SMs on RTX 3090 (each with 2xFP32 though). The claimed boost clocks are ~1.4GHz for A100 and ~1.7GHz for RTX 3090. Now let's look at the texturing tests...

- A100: 263GTexel/s
- RTX 3090: 452GTexel/s

263 * (82 SMs / 61 SMs) * (1.7GHz / 1.4GHz) = 429GTexel/s... which is only ~5% less than what the RTX 3090 actually gets.

So yeah, it's hard to tell, and using random unverified GFXBench numbers is even worse than using GFXBench at all (ugh)... but my suspicion is that NVIDIA didn't solve that problem in the A100 timeframe and they are only using half the GPU in Graphics Mode. Of course, even if that's true, they could have solved it for Blackwell... it's not an easy problem to solve efficiently though. That's a LOT more (kinds of) inter-chip communication than just L2 cache coherency.

DegustatoR · May 24, 2024

Erinyes said:
Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's.

Rumor is that it's like GA100 i.e. is a split chip design in all but physical implementation. Which in turn means that they don't need to solve anything with it just yet.

trinibwoy · May 24, 2024

Wouldn't A100's on-chip L2 interconnect be a lot faster than any hypothetical NvLink/MCM setup? If they couldn't make it work for A100 seems there's little chance it'll work for B20x.

Erinyes · May 24, 2024

DegustatoR said:
Rumor is that it's like GA100 i.e. is a split chip design in all but physical implementation. Which in turn means that they don't need to solve anything with it just yet.

But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.

T2098 · May 24, 2024

Erinyes said:
But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.

It's entirely possible that they've decided to dial the L2 cache way back; SRAM scales worse than logic on newer nodes, and if it's another monolithic design, less L2 means more area for other things.
Even AMD reduced the infinity cache size from RDNA2 to RDNA3.

Erinyes · May 24, 2024

T2098 said:
It's entirely possible that they've decided to dial the L2 cache way back; SRAM scales worse than logic on newer nodes, and if it's another monolithic design, less L2 means more area for other things.
Even AMD reduced the infinity cache size from RDNA2 to RDNA3.

SRAM scaling indeed has become much worse, and is actually zero from 5nm to 3nm (TSMC N3E). But given that they're most likely still on 4nm, there shouldn't be any change as such. AMD reduced the infinity cache (though made it faster supposedly) when they moved from 7nm to 5nm, which still had some SRAM scaling. Keep in mind analog also does not scale as well with lower nodes so the additional memory controllers will also take disproportionally more area. Another thing to keep in mind is SRAM's benefit is not just performance, but power as well as you use more power going off chip. So its a fine balance to figure out the best combination of all of these factors.

DegustatoR · May 24, 2024

Erinyes said:
But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.

AD102 is 600mm^2, reticle limit is what, about 830mm^2? It is also presumably a new architecture which may pack more performance per transistor - although that would be reversing the trend of late.
They could lower the cache sizes, they could use "slow" G7 or they could just go with 512 bits simply because the chip is 2 chips with 256 bit buses really.
Considering that the expectations are for GB203 to be on par or faster than AD102 while using a 256 bit bus and G7 (which should give +/- the same b/w as that of AD102's 384 bit G6X) the "dumb" doubling of that would require 512 bit bus with the same memory or it would have worse flop/byte ratio than GB203.

Frenetic Pony · May 24, 2024

DegustatoR said:
450W unless they'll decide to downclock it due to a lack of competition.

525 is possible, and they've been pushing it up every generation for what, 2 or 3 gens now? No reason to stop.

Newguy · May 24, 2024

Will Nvidia be conservative at the start at least for the first year given what their competitors are apparently doing? For example reducing total power output, cut down GPUs more etc, then a year later/whenever they have more competition roll out Ti/Super models and essentially double dip, something like the Kepler gen? Or is it going to be releases as normal with price increases, $2k+ 5090 for example then potentially reduce prices later if competition forces them to?

DegustatoR · May 25, 2024

Frenetic Pony said:
525 is possible

Why 525? Max power of the new power connector is 600W so if we're talking about what's possible then this is the maximum.
In the absence of competition though there are no reasons to push it that high.

Frenetic Pony said:
and they've been pushing it up every generation for what, 2 or 3 gens now? No reason to stop.

They haven't. 30 and 40 series both top out at 450W.

techuse · May 25, 2024

I would be surprised if they push power limits up again when AMD will not be competing. 3nm should be enough for a 30-50% increase without increasing power, no?

Frenetic Pony · May 25, 2024

DegustatoR said:
Why 525? Max power of the new power connector is 600W so if we're talking about what's possible then this is the maximum.
In the absence of competition though there are no reasons to push it that high.

They haven't. 30 and 40 series both top out at 450W.

30 topped out at 350w until the 3090ti was released, at launch it was 350w max though. 40 was going to do something similar, launch was 450 but there was a 4090ti that was cancelled.

And why not, why stop? They'll have competition next year, and they can charge more $$$ for it regardless. Nvidia doesn't care about your power bill. If it's not the 5090 then it's the "RTX Titan" at 500-600w next year. But as I explained, I'm not sure you can aircool 600w in a user controlled case today, it wasn't long ago 450w was the max aircooled device even in servers. Today servers can do 600w barely, but that's in specialty designed super high end server cases.

But in a user controlled case? 500 something watts seems plausible, and they've been working on it since the cancelled 4090ti (which apparently had problems with the power supply more than the cooling). Watercooling has too many problems to sell to consumers, even a lot of datacenter stuff wants to avoid it (thus the new high end server chassis that can handle 600w device cooling).

xpea · May 25, 2024

Frenetic Pony said:
But in a user controlled case? 500 something watts seems plausible,

Bingo. That's the target for 5090

trinibwoy · May 25, 2024

xpea said:
Bingo. That's the target for 5090

PSU manufacturers are salivating.

Speculation and Rumors: Nvidia Blackwell ...

T2098

Zotac GeForce GTX 280 Amp! Edition Review

Leadtek GeForce GTX 285 1024 MB Review

Seanspeed

Frenetic Pony

DegustatoR

TopSpoiler

Erinyes

Arun

Unknown.

DegustatoR

trinibwoy

Meh

Erinyes

T2098

Erinyes

DegustatoR

Frenetic Pony

Newguy

DegustatoR

techuse

Frenetic Pony

xpea

trinibwoy

Meh