Speculation and Rumors: Nvidia Blackwell ...

Granath · Mar 19, 2024

Seanspeed said:
It's honestly super un-exciting. Per processor, the improvements really aren't that big at all. They must be quite assured in their current lead given this is supposed to be their new flagship for the next two years.

This new era of 'more performance by using more silicon' is gonna kinda suck.

You don't get the goal. And the goal was stock.

Seanspeed · Mar 19, 2024

Granath said:
You don't get the goal. And the goal was stock.

............what?

DegustatoR · Mar 19, 2024

Seanspeed said:
It's honestly super un-exciting. Per processor, the improvements really aren't that big at all.

2X perf/watt/price aren't big at all? What are you even talking about?

Seanspeed · Mar 19, 2024

DegustatoR said:
2X perf/watt/price aren't big at all? What are you even talking about?

I missed where they said anything about pricing. Care to point that out for me?

pharma · Mar 19, 2024

DegustatoR said:
2X perf/watt/price aren't big at all? What are you even talking about?

When the competition initially switched to a mcm design the increase was even less, though perhaps mcm architectural design accounts for the difference.

Edit: It might be time for a MCM architecture discussion thread.

DegustatoR · Mar 19, 2024

Seanspeed said:
I missed where they said anything about pricing. Care to point that out for me?

Nvidia is not talking about pricing for the B100 or B200 or its HGX B100 system boards, which will plug into existing HGX H100 server designs because they have the same thermals and therefore the same heat sinks. We expect for there to be at least a 25 percent premium on the price of the HGX B100 compared to the HGX H100, which would put it at around $250,000 for around 2.5X the performance, roughly speaking, at the same precision of math. That price could be — almost certainly will be — a lot higher on the street, of course, as happened with the Hopper generation.

With Blackwell GPUs, AI Gets Cheaper And Easier, Competing With Nvidia Gets Harder

If you want to take on Nvidia on its home turf of AI processing, then you had better bring more than your A game. You better bring your A++ game, several

www.nextplatform.com

And unofficial info is that there won't be any "premium" at all. But the last sentence is certainly true.

pharma · Mar 19, 2024

Interesting quote from Ian Buck in the NextPlatform link above ...

“The transformer engine as it was originally invented with Hopper, what it does is it tracks the accuracy and the dynamic range of every layer of every tensor in the entire neural network as it proceeds in computing,” explained Buck. “And as the model is training over time, we are constantly monitoring the ranges of every layer and adapting to stay within the bounds of the numerical precision to get the best performance. In Hopper, this bookkeeping extends to a 1,000-way history to compute updates and scale factors to allow the entire computation to happen in only eight bits precision.

With Blackwell, we take it a step further. In hardware, we can adjust the scaling on every tensor. Blackwell supports micro tensor scaling is not the entire tensor, which we can still monitor, but now we can look at the individual elements within the tensor. To take this a step further, Blackwell’s second generation Transformer Engine will allow us to take AI computing to FP4, or using only four bits floating point representation, to perform the AI calculation. That’s four zeros and ones for every neuron, every connection – literally the numbers one through 16.

Getting down to that level of fine granularity is a miracle in itself. And the second generation Transformer Engine does that work combined with Blackwell’s micro tensor scaling, and that means we can deliver twice the amount of compute as before, we can double the effective bandwidth because from eight bits to four bits is half the size. And of course, double the model size can fit on an individual GPU.”

trinibwoy · Mar 19, 2024

How does the 10TB/s inter-chip link compare to the fastest TSV interconnects? It certainly seems much faster than all the other interposer based 2D stuff out there right now.

Bondrewd · Mar 19, 2024

trinibwoy said:
How does the 10TB/s inter-chip link compare to the fastest TSV interconnects?

It's ok.
Should be enough for their rather dinky L2 (aka the inevitable happened).

trinibwoy said:
It certainly seems much faster than all the other interposer based 2D stuff out there right now.

MI300X is 20TB/s bidir for all the USRs. lol

troyan · Mar 19, 2024

Its 1.2 TB/s/dir. A little bit off from nVidia's 5 TB/s/dir:

Bondrewd · Mar 19, 2024

troyan said:
Its 1.2 TB/s/dir. A little bit off from nVidia's 5 TB/s/dir:

1.2TB/s east/west, 1.5TB/s north/south.
Per die, 4 die, a bit over 10TB/s aggregate. Pretty good for a 35um pitch thingy.
The thing was built with N3e refresh in mind which, again, not happening.

Arun · Mar 19, 2024

1.5TB/s for the fastest direction on MI300 and Hopper has a larger perimeter for the relevant side, but yeah, if you're looking at "bandwidth per mm" then it looks like it might be ~2x MI300 and other competitors? Arguably what matters more here is power efficiency though which is sadly nearly impossible to measure just for that part as an outsider.

Bondrewd · Mar 19, 2024

Arun said:
if you're looking at "bandwidth per mm" then it looks like it might be ~2x MI300

Not quite, but it does have a denser d2d pitch (25um, same as Apple multidie stuff. it's LSI after all).

Arun said:
Arguably what matters more here is power efficiency though which is sadly nearly impossible to measure just for that part as an outsider.

GPGPU stuff is always you-bench-it, since workloads differ pretty majorly.

neckthrough · Mar 19, 2024

Bondrewd said:
1.2TB/s east/west, 1.5TB/s north/south.
Per die, 4 die, a bit over 10TB/s aggregate. Pretty good for a 35um pitch thingy.
The thing was built with N3e refresh in mind which, again, not happening.

Based on what you described it seems the bisection bandwidth is 2.4TB/s (which is how you should measure aggregate network bandwidth instead of just adding up all the link capacities), unless I'm misunderstanding the topology.

Bondrewd · Mar 19, 2024

neckthrough said:
unless I'm misunderstanding the topology

Yeah the topology there is real fucking weird, same as Genoa sIOD.
Not a mesh, not a ring either.

Dangerman · Mar 19, 2024

Considering there's apparently a 30% transistory density increase I do wonder; where does that put the 96SM rumoured GB203? Because if that is technically smaller than AD103 like maybe 350mm2 I do wonder if that means a 700mm2 GB202 which on N4P/X sounds way more economical than a 600mm2 N3E die.

Bondrewd · Mar 19, 2024

Dangerman said:
sounds way more economical than a 600mm2 N3E die.

Yeah but perf/power benefits of N3e should not be disregarded like that.

trinibwoy · Mar 19, 2024

Bondrewd said:
It's ok.
Should be enough for their rather dinky L2 (aka the inevitable happened).

MI300X is 20TB/s bidir for all the USRs. lol

Dinky compared to Hopper or MI300? I’m not sure we’ll ever get a real look at the latencies and bandwidths at scale in real life workloads. What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?

Another curiosity is Nvidia’s marketing of the 72 GB200’s as a “single” GPU. Even if technically true is that even practical? Presumably it’s easier to keep your working set as local as possible by orchestrating work distribution from the host CPU to individual compute nodes.

pharma · Mar 19, 2024

trinibwoy said:
I’m not sure we’ll ever get a real look at the latencies and bandwidths at scale in real life workloads. What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?

Some of the at scale bandwidth and latency info will likely be provided from the MIPerf benchmark results and analysis.

Bondrewd · Mar 19, 2024

trinibwoy said:
Dinky compared to Hopper or MI300?

Compared to L2's of yore in general.
Even Ponte Vecchio (lol) had higher per-base L2 bandwidth (before their NOC falls apart) than H100.
MI300 is a whole different beast, L2 isn't LLC there, not comparable. Just like RDNA (and mobile GPUs in general) it has more cache levels.

trinibwoy said:
What does L2 hit rate look like when your working set spans dozens of GPUs and all the network hops in between?

Your kernels are sliced to sit inside the L1/shmem slab as much as possible, at least on Hopper. See Flashattention2 etc.

Remember, on H200 the far L2 partition will be ~same effective b/w and the HBM pile.

trinibwoy said:
Even if technically true is that even practical?

It's nominally practical BUT there are caveats.
It's just nicer (way nicer actually) than hitting limited allotments of 400G Ethernet.

Speculation and Rumors: Nvidia Blackwell ...

Granath

Seanspeed

DegustatoR

Seanspeed

pharma

DegustatoR

With Blackwell GPUs, AI Gets Cheaper And Easier, Competing With Nvidia Gets Harder

pharma

trinibwoy

Meh

Bondrewd

troyan

Bondrewd

Arun

Unknown.

Bondrewd

neckthrough

Bondrewd

Dangerman

Bondrewd

trinibwoy

Meh

pharma

Bondrewd