Speculation and Rumors: Nvidia Blackwell ...

GT200 was 512-bit as well, but the photos of those I can find from back in the day were also 8 chips on each side of the PCB:
Actually, scratch that, there were versions with 16 chips on the frontside too, like this one:

GDDR3 was a lot more relaxed with trace length and routing than GDDR6/7 though; I'm not sure the trace lengths there would fly nowadays.
 
Last edited:
I'm a bit skeptical it's 512-bit, but if it is, then with GDDR7 and the maximum number of DRAM chips, they could go up to 96GiB on professional cards which is more than the original H100... that'd be quite interesting! As long as they don't include NVLink, there's still strong differentiation for B100/GB200, and they'd probably limit the consumer/GeForce version to 32GiB as well (half as many chips, and 16Gbit rather than 24Gbit).
It wouldn't be too shocking according to some of the rumors that the 5090/GB202 is basically just a consumer/light business focused AI GPU more than anything.
 
I wonder what TDP the big AI version and the consumer version will run at?

For AI the obvious choice is 600w, but for consumer? 600w isn't impossible to run under aircooling today, but I'm under the impression that's only possible on highly engineered server cases, 450w was the limit not that long ago. I'll stick with 525w, as that would be PCIE5(75w) slot + 450w cable draw and would be on the edge of possible for like, a quad slot cooler with one of the fancy new expensive thermal pastes. I think?
 
I'm a bit skeptical it's 512-bit, but if it is, then with GDDR7 and the maximum number of DRAM chips, they could go up to 96GiB on professional cards which is more than the original H100... that'd be quite interesting! As long as they don't include NVLink, there's still strong differentiation for B100/GB200, and they'd probably limit the consumer/GeForce version to 32GiB as well (half as many chips, and 16Gbit rather than 24Gbit).

Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's. So they wouldn't need to design and tape out a 512 bit limited volume chip. If NV has managed to make it work with B100, could work for consumer graphics as well perhaps?
 
Rumour is it could be a B200 like MCM of 2 x 256 bit GB203's. So they wouldn't need to design and tape out a 512 bit limited volume chip. If NV has managed to make it work with B100, could work for consumer graphics as well perhaps?
That's possible, but I'm not 100% sure whether NVIDIA ever actually solved making graphics work across their split L2 GPUs. On H100, we know graphics performance is intentionally terrible (not clear if they genuinely managed to save most of the silicon and it's only present in a single GPC, or if they are just doing yield recovery so they don't lose chips when the graphics parts which are still there are defective)... But A100 is a bit more mysterious to me.

The only public benchmark of A100 for Graphics I could find is... *sigh* GFXBench:
- A100: https://gfxbench.com/device.jsp?D=NVIDIA+A100-PCIE-40GB&testgroup=overall
- RTX 3090: https://gfxbench.com/device.jsp?D=NVIDIA+GeForce+RTX+3090&testgroup=overall

Ignore Tessellation/ALU2 because they're just broken on very fast chips afaict, but everything else that I'd half-trust-on-a-good-day is just terrible for a chip that size, e.g. texturing test (ignore absolute numbers, I know exactly how it works, and it's a bit silly - but relative should be fine on the same architecture) and Manhattan 3.1.1 1440p Offscreen.

So I did some quick calculations... A100 40GiB PCIe has 7 out of 8 GPCs active with 82 SMs in total. Let's assume they *don't* support graphics using both sides of the GPU, and they use the side with 4 GPCs active, so performance should be 4/7th of peak, i.e. ~61 SMs out of 108. That compares to 82 SMs on RTX 3090 (each with 2xFP32 though). The claimed boost clocks are ~1.4GHz for A100 and ~1.7GHz for RTX 3090. Now let's look at the texturing tests...

- A100: 263GTexel/s
- RTX 3090: 452GTexel/s

263 * (82 SMs / 61 SMs) * (1.7GHz / 1.4GHz) = 429GTexel/s... which is only ~5% less than what the RTX 3090 actually gets.

So yeah, it's hard to tell, and using random unverified GFXBench numbers is even worse than using GFXBench at all (ugh)... but my suspicion is that NVIDIA didn't solve that problem in the A100 timeframe and they are only using half the GPU in Graphics Mode. Of course, even if that's true, they could have solved it for Blackwell... it's not an easy problem to solve efficiently though. That's a LOT more (kinds of) inter-chip communication than just L2 cache coherency.
 
Wouldn't A100's on-chip L2 interconnect be a lot faster than any hypothetical NvLink/MCM setup? If they couldn't make it work for A100 seems there's little chance it'll work for B20x.
 
Rumor is that it's like GA100 i.e. is a split chip design in all but physical implementation. Which in turn means that they don't need to solve anything with it just yet.

But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.
 
But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.
It's entirely possible that they've decided to dial the L2 cache way back; SRAM scales worse than logic on newer nodes, and if it's another monolithic design, less L2 means more area for other things.
Even AMD reduced the infinity cache size from RDNA2 to RDNA3.
 
It's entirely possible that they've decided to dial the L2 cache way back; SRAM scales worse than logic on newer nodes, and if it's another monolithic design, less L2 means more area for other things.
Even AMD reduced the infinity cache size from RDNA2 to RDNA3.
SRAM scaling indeed has become much worse, and is actually zero from 5nm to 3nm (TSMC N3E). But given that they're most likely still on 4nm, there shouldn't be any change as such. AMD reduced the infinity cache (though made it faster supposedly) when they moved from 7nm to 5nm, which still had some SRAM scaling. Keep in mind analog also does not scale as well with lower nodes so the additional memory controllers will also take disproportionally more area. Another thing to keep in mind is SRAM's benefit is not just performance, but power as well as you use more power going off chip. So its a fine balance to figure out the best combination of all of these factors.
 
Last edited:
But wouldn't 512 bit GDDR7 be overkill then? If it was still GDDR6X, then the only way to get increased bandwidth would be increased bus width. But given that they're practically on the same node so presumably not a huge increase in chip size/performance, GDDR7 itself should be enough of an increase with the same 384 bit bus.
AD102 is 600mm^2, reticle limit is what, about 830mm^2? It is also presumably a new architecture which may pack more performance per transistor - although that would be reversing the trend of late.
They could lower the cache sizes, they could use "slow" G7 or they could just go with 512 bits simply because the chip is 2 chips with 256 bit buses really.
Considering that the expectations are for GB203 to be on par or faster than AD102 while using a 256 bit bus and G7 (which should give +/- the same b/w as that of AD102's 384 bit G6X) the "dumb" doubling of that would require 512 bit bus with the same memory or it would have worse flop/byte ratio than GB203.
 
Will Nvidia be conservative at the start at least for the first year given what their competitors are apparently doing? For example reducing total power output, cut down GPUs more etc, then a year later/whenever they have more competition roll out Ti/Super models and essentially double dip, something like the Kepler gen? Or is it going to be releases as normal with price increases, $2k+ 5090 for example then potentially reduce prices later if competition forces them to?
 
I would be surprised if they push power limits up again when AMD will not be competing. 3nm should be enough for a 30-50% increase without increasing power, no?
 
Why 525? Max power of the new power connector is 600W so if we're talking about what's possible then this is the maximum.
In the absence of competition though there are no reasons to push it that high.


They haven't. 30 and 40 series both top out at 450W.

30 topped out at 350w until the 3090ti was released, at launch it was 350w max though. 40 was going to do something similar, launch was 450 but there was a 4090ti that was cancelled.

And why not, why stop? They'll have competition next year, and they can charge more $$$ for it regardless. Nvidia doesn't care about your power bill. If it's not the 5090 then it's the "RTX Titan" at 500-600w next year. But as I explained, I'm not sure you can aircool 600w in a user controlled case today, it wasn't long ago 450w was the max aircooled device even in servers. Today servers can do 600w barely, but that's in specialty designed super high end server cases.

But in a user controlled case? 500 something watts seems plausible, and they've been working on it since the cancelled 4090ti (which apparently had problems with the power supply more than the cooling). Watercooling has too many problems to sell to consumers, even a lot of datacenter stuff wants to avoid it (thus the new high end server chassis that can handle 600w device cooling).
 
Last edited:
Back
Top