Speculation and Rumors: Nvidia Blackwell ...

It's honestly super un-exciting. Per processor, the improvements really aren't that big at all. They must be quite assured in their current lead given this is supposed to be their new flagship for the next two years.

This new era of 'more performance by using more silicon' is gonna kinda suck.
You don't get the goal. And the goal was stock.
 
2X perf/watt/price aren't big at all? What are you even talking about?
When the competition initially switched to a mcm design the increase was even less, though perhaps mcm architectural design accounts for the difference.

Edit: It might be time for a MCM architecture discussion thread.
 
Last edited:
I missed where they said anything about pricing. Care to point that out for me?
Nvidia is not talking about pricing for the B100 or B200 or its HGX B100 system boards, which will plug into existing HGX H100 server designs because they have the same thermals and therefore the same heat sinks. We expect for there to be at least a 25 percent premium on the price of the HGX B100 compared to the HGX H100, which would put it at around $250,000 for around 2.5X the performance, roughly speaking, at the same precision of math. That price could be — almost certainly will be — a lot higher on the street, of course, as happened with the Hopper generation.
And unofficial info is that there won't be any "premium" at all. But the last sentence is certainly true.
 
Interesting quote from Ian Buck in the NextPlatform link above ...

“The transformer engine as it was originally invented with Hopper, what it does is it tracks the accuracy and the dynamic range of every layer of every tensor in the entire neural network as it proceeds in computing,” explained Buck. “And as the model is training over time, we are constantly monitoring the ranges of every layer and adapting to stay within the bounds of the numerical precision to get the best performance. In Hopper, this bookkeeping extends to a 1,000-way history to compute updates and scale factors to allow the entire computation to happen in only eight bits precision.

With Blackwell, we take it a step further. In hardware, we can adjust the scaling on every tensor. Blackwell supports micro tensor scaling is not the entire tensor, which we can still monitor, but now we can look at the individual elements within the tensor. To take this a step further, Blackwell’s second generation Transformer Engine will allow us to take AI computing to FP4, or using only four bits floating point representation, to perform the AI calculation. That’s four zeros and ones for every neuron, every connection – literally the numbers one through 16.

Getting down to that level of fine granularity is a miracle in itself. And the second generation Transformer Engine does that work combined with Blackwell’s micro tensor scaling, and that means we can deliver twice the amount of compute as before, we can double the effective bandwidth because from eight bits to four bits is half the size. And of course, double the model size can fit on an individual GPU.”
 
How does the 10TB/s inter-chip link compare to the fastest TSV interconnects? It certainly seems much faster than all the other interposer based 2D stuff out there right now.
 
Its 1.2 TB/s/dir. A little bit off from nVidia's 5 TB/s/dir:
1.2TB/s east/west, 1.5TB/s north/south.
Per die, 4 die, a bit over 10TB/s aggregate. Pretty good for a 35um pitch thingy.
The thing was built with N3e refresh in mind which, again, not happening.
 
1.5TB/s for the fastest direction on MI300 and Hopper has a larger perimeter for the relevant side, but yeah, if you're looking at "bandwidth per mm" then it looks like it might be ~2x MI300 and other competitors? Arguably what matters more here is power efficiency though which is sadly nearly impossible to measure just for that part as an outsider.
 
if you're looking at "bandwidth per mm" then it looks like it might be ~2x MI300
Not quite, but it does have a denser d2d pitch (25um, same as Apple multidie stuff. it's LSI after all).
Arguably what matters more here is power efficiency though which is sadly nearly impossible to measure just for that part as an outsider.
GPGPU stuff is always you-bench-it, since workloads differ pretty majorly.
 
1.2TB/s east/west, 1.5TB/s north/south.
Per die, 4 die, a bit over 10TB/s aggregate. Pretty good for a 35um pitch thingy.
The thing was built with N3e refresh in mind which, again, not happening.
Based on what you described it seems the bisection bandwidth is 2.4TB/s (which is how you should measure aggregate network bandwidth instead of just adding up all the link capacities), unless I'm misunderstanding the topology.
 
Considering there's apparently a 30% transistory density increase I do wonder; where does that put the 96SM rumoured GB203? Because if that is technically smaller than AD103 like maybe 350mm2 I do wonder if that means a 700mm2 GB202 which on N4P/X sounds way more economical than a 600mm2 N3E die.
 
It's ok.
Should be enough for their rather dinky L2 (aka the inevitable happened).

MI300X is 20TB/s bidir for all the USRs. lol

Dinky compared to Hopper or MI300? I’m not sure we’ll ever get a real look at the latencies and bandwidths at scale in real life workloads. What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?

Another curiosity is Nvidia’s marketing of the 72 GB200’s as a “single” GPU. Even if technically true is that even practical? Presumably it’s easier to keep your working set as local as possible by orchestrating work distribution from the host CPU to individual compute nodes.
 
I’m not sure we’ll ever get a real look at the latencies and bandwidths at scale in real life workloads. What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?
Some of the at scale bandwidth and latency info will likely be provided from the MIPerf benchmark results and analysis.
 
Dinky compared to Hopper or MI300?
Compared to L2's of yore in general.
Even Ponte Vecchio (lol) had higher per-base L2 bandwidth (before their NOC falls apart) than H100.
MI300 is a whole different beast, L2 isn't LLC there, not comparable. Just like RDNA (and mobile GPUs in general) it has more cache levels.
What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?
Your kernels are sliced to sit inside the L1/shmem slab as much as possible, at least on Hopper. See Flashattention2 etc.
1710867930860.png
Remember, on H200 the far L2 partition will be ~same effective b/w and the HBM pile.
Even if technically true is that even practical?
It's nominally practical BUT there are caveats.
It's just nicer (way nicer actually) than hitting limited allotments of 400G Ethernet.
 
Back
Top