Speculation and Rumors: Nvidia Blackwell ...

Bondrewd · Feb 29, 2024

trinibwoy said:
Source for this?

Papers or test stuff NV does (they do 0).
Intel has 3d stacked test chips in 2007 (Polaris).
GPU guys are toddlers in that respect.

trinibwoy said:
What’s N100?

B100 replacement.

trinibwoy said:
Big monolithic die would be my guess

A few of them.
Same as B100.

DegustatoR · Feb 29, 2024

Bondrewd said:
Papers or test stuff NV does (they do 0).

GPU Domain Specialization via Composable On-Package Architecture | ACM Transactions on Architecture and Code Optimization

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging ...

dl.acm.org

So not exactly "zero" apparently.

Bondrewd · Feb 29, 2024

DegustatoR said:
So not exactly "zero" apparently.

That's not 3D. They also did a paper on basic 2D/2.5D signaling like two years ago (which, the former is GH200 and the latter is B100).
Nothing at the scale of Intel Labs efforts or AMD Research publishing the paper that became MI300A in like the 2015 round of DOE FastForward2 programs.

DegustatoR · Feb 29, 2024

Bondrewd said:
That's not 3D.

The architectural domain customization in our proposed COPA-GPU is achieved through integration of a GPM with a dedicated domain-optimized MSM using 2.5D or 3D on-package integration that could leverage either planar or vertical die stacking approaches.

Also this paper is from 2021.

Bondrewd · Feb 29, 2024

DegustatoR said:
Also this paper is from 2021.

It's from very late 2021 and effectively is 2yo.
Itself is mostly 2.5D, they don't really focus on fancy 3D APU design proposals much.
Compare it to AMD's Fastforward2 EHP paper and...
Either way Intel's technically ahead of all in 2.5D/3D race until a while, just that their attempts at it in GPGPU failed hard for reasons unrelated (rip ATS-2T/4T and holy shit PVC oh no. just no).

Packaging is CPU lands expertise, Intel in particular was the first mover in many things (FCBGA, ABI laminates, active interposers with LKF etc).

Arun · Feb 29, 2024

Bondrewd said:
This is expensive low throughput packaging that's not suitable for mainstream parts.

Agreed, I can't see 3D packaging being cost effective and high enough volume for an AD104-level product in 2024 so not for Blackwell, but eventually it should come down in price.

Bondrewd said:
It's up.
It's actually up.
N3e wins you on power/perf but it costs more per xtor which is stinky.
It's a real issue and N2 is even worse wrt that (to the point of majorly influencing packaging decisions for Venice -dense).

No, you're still focusing on price per transistor, but I claimed something more subtle & important: "logic transistor times iso-power performance". The "logic" excludes I/O and SRAM transistors which are not scaling. And if you take NVIDIA SMs and increase their clock frequency by 15% at iso-power, you only need 100 SMs to achieve the same performance as 115 SMs (excluding latency tolerance).

N3E logic area is supposedly 0.625x of N5 and iso-power perf is +18% so "(1/0.625)*1.18 = 1.89" = +89% *logic-only* performance for the same area (old numbers from TSMC, I think it might actually be slightly better now for some FinFlex variants, I'm not entirely sure). Even if you believe N3E really is 20K/wafer and 25% more expensive than N5, that's still 42% higher perf/$ (assuming comparable yields). Unfortunately, that's only logic transistors, so without 2.5D/3D packaging to get rid of all I/O and much of the SRAM, the overall perf/$ isn't good.

Bondrewd said:
NV has no experience or any real pathfinding into doing very fancy 3D stuff, it's all Intel and AMD (all things MCM in general are CPU land since many-many decades ago).

This isn't really their job: it's TSMC's job. AMD and other fabless companies can and do contribute significantly but fabless customers don't typically lead or own that kind of research AFAIK. So it shouldn't be an issue being a fast follower without doing as much of that kind of work yourself. That's also partly why Intel seems so far ahead: they are marketing what their fabs/foundries are capable of far in advance of product availability. Finally, NVIDIA or other fabless companies are under no obligation to publish any of their internal testing or research (unless they are required/encouraged to do so for contracts with the DoE/DARPA/etc.)

Both B100 and N100 are very simple straightforward products (big retsized die times n on CoWoS-L).

Are you saying multiple identical dies in a 2.5D package ala Apple or AMD MI250X (but working as a single GPU ala MI300X)?

If we're only talking about the AI flagship, and it's 2.5D rather than 3D, then sure, maybe. That sounds like ~1500W of power to me though (compared to Hopper which is 700W on 4N for a single retsized die) which isn't realistic. So it'd have to be significantly undervolted with lower clocks than H100. I stumbled upon this article which claims 1000W for B100: https://asia.nikkei.com/Spotlight/Supply-Chain/AI-boom-drives-demand-for-server-cooling-technology

You'll have to forgive me if I don't share your confidence and don't really believe you, but I have been around this circus enough times in the last 20+ years around major GPU architecture releases (e.g. NV30/NV40/G80/Fermi/Volta/etc. and R420/R600/GCN/etc. which were much bigger steps than the ones in-between) and the rumours are nearly always completely wrong, even more so than usual. So I'm not going to take anything too seriously at this stage (and you definitely shouldn't take anything I say too seriously either!)

Bondrewd · Feb 29, 2024

Arun said:
but eventually it should come down in price.

It's not really that expensive, just a throughput problem.
d2w hybrid bonding is a pretty slow process.

Arun said:
And if you take NVIDIA SMs and increase their clock frequency by 15% at iso-power, you only need 100 SMs to achieve the same performance as 115 SMs (excluding latency tolerance).

Yeah but that's borderline whatever, incremental bump not a single soul on the planet cares about.
People bitch about 10% perf/$ bumps as is, spacing them over distinct full node shrinks only gonna make the situation worse.

Arun said:
N3E logic area is supposedly 0.625x of N5 and iso-power perf is +18% so "(1/0.625)*1.18 = 1.89" = +89% *logic-only* performance for the same area (old numbers from TSMC, I think it might actually be slightly better now for some FinFlex variants, I'm not entirely sure)

GPUs are like half SRAM.
It just doesn't work out in the end at all.

Arun said:
Even if you believe N3E really is 20K/wafer and 25% more expensive than N5

It's 22k wafers versus 16-17k N5 derivatives.

Arun said:
and much of the SRAM

You can't, SM/CUs themselves contain the bulk of GPU SRAM.

Arun said:
This isn't really their job

No it's definitely theirs. TSMC can only do the actual w2w/d2w bonding, but for that you need to design the system around it and that's the hard part.
Please consult MI300 ISSCC slideware, I assume you have IEEE membership for that.

Arun said:
AMD and other fabless companies can and do contribute significantly but fabless customers don't typically lead or own that kind of research AFAIK

Oh no they do a lot.
Fabless is a borderline misnomer these days given how involved both design and pathfinding for bleeding edge node is from a customer POV.

Arun said:
(unless they are required/encouraged to do so for contracts with the DoE/DARPA/etc.)

Yea that's how we get stuff done. DOE programs.

Arun said:
Are you saying multiple identical dies in a 2.5D package ala Apple or AMD MI250X (but working as a single GPU ala MI300X)?

Yes and maybe. Probably not. Possible Superchip™ hackery.

Arun said:
That sounds like ~1500W of power to me though

Downclock it and et voila.
Also remember that B100 is not 100% more HBM, just 33% more and obviously the I/O is mostly the same even if higher signaling rates.

Arun said:
I stumbled upon this article which claims 1000W for B100:

That's correct.

Arun said:
You'll have to forgive me if I don't share your confidence and don't really believe you, but I have been around this circus enough times in the last 20+ years around major GPU architecture releases (e.g. NV30/NV40/G80/Fermi/Volta/etc. and R420/R600/GCN/etc. which were much bigger steps than the ones in-between) and the rumours are nearly always completely wrong, even more so than usual.

I'm every bit as old as you are so no worries.

trinibwoy · Feb 29, 2024

Bondrewd said:
Yes and maybe. Probably not. Possible Superchip™ hackery.

Superchip is multiple packages on the same PCB right? That would be disappointing. Gotta love Nvidia’s marketing.

Bondrewd · Feb 29, 2024

trinibwoy said:
Superchip is multiple packages on the same PCB right?

Just applies with anything NUCA and more than one die I think.
The branding is weird anyway, since Grace-Grace is just a standard 2p system on a stick.

B100 is two die on one CoWoS-L poser (that's goo with 25um pitch bridges).

McHuj · Feb 29, 2024

Bondrewd said:
B100 is two die on one CoWoS-L poser (that's goo with 25um pitch bridges).

4 HBM stacks per die. I don’t think they can manufacture a single die with enough phy space for 8 HBM stacks.

Also much increased FP8 performance and even smaller native FP data type support.

Bondrewd · Mar 1, 2024

McHuj said:
I don’t think they can manufacture a single die with enough phy space for 8 HBM stacks.

Yeah, all 12 stack parts (MI400, FCS1 and N100) are 2025.

McHuj said:
Also much increased FP8 performance and even smaller native FP data type support.

Yeah OCP block FP support coming.
Kinda funny that Flexpoint won in the end.

troyan · Mar 1, 2024

Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster? Now compare that to MI300X which has 3x more transistors for barely better performance than H100.

nVidia has an architecture which is cutting edge. They dont need "fancy" chiplets. Blackwell will instantly kill MI300X. AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...

trinibwoy · Mar 1, 2024

McHuj said:
4 HBM stacks per die. I don’t think they can manufacture a single die with enough phy space for 8 HBM stacks.

Also much increased FP8 performance and even smaller native FP data type support.

Why not 6 stacks + NVLink like H100? Too much power? Nvidia is vulnerable long term on memory capacity as generative models can’t get enough. 12 x 12 high HBM3e stacks would help hold the fort. AMD will almost certainly be at 8 x 12 in the nearish future.

As much as SRAM doesn’t scale I still wouldn’t bet on Nvidia moving to a small L2 and big off-die L3 with Blackwell. If that’s where they want to go eventually it would make more sense to work out the L3 kinks on-die first similar to AMD.

Deleted member 2197 · Mar 1, 2024

troyan said:
Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster? Now compare that to MI300X which has 3x more transistors for barely better performance than H100.

nVidia has an architecture which is cutting edge. They dont need "fancy" chiplets. Blackwell will instantly kill MI300X. AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...

I think we need to see how well GH200 performs in the upcoming MLPerf tests against MI300 series. Then we can better speculate architectural uplifts needed to maintain pace with Blackwell and the MI400 counterpart. Architecture reveal should happen March 21 at GTC 2024.

Bondrewd · Mar 1, 2024

troyan said:
Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster?

"up to" isn't a metric, and Hopper has horrible vibes wrt getting any perf out of the thing.

troyan said:
for barely better performance than H100.

proofs?

troyan said:
Blackwell will instantly kill MI300X.

"MI400A will instantly kill B100".
So what?

troyan said:
AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...

bro that's cope, the refresh was planned eoy'22.

trinibwoy said:
Why not 6 stacks + NVLink like H100?

Need 4x ret interposer (well, CoWoS-L) for that which isn't due until 2025 iirc.

Granath · Mar 1, 2024

Bondrewd said:
"up to" isn't a metric, and Hopper has horrible vibes wrt getting any perf out of the thing.

proofs?

"MI400A will instantly kill B100".
So what?

bro that's cope, the refresh was planned eoy'22.

Need 4x ret interposer (well, CoWoS-L) for that which isn't due until 2025 iirc.

MI400 will be a year later than B100

Deleted member 2197 · Mar 1, 2024

Folks, let's revert the conversation back to Blackwell and wait for benchmarks to corroborate some of this information.
GTC 2024 and MLPerf are just around the corner and should be plenty performance data on the current offering and Blackwell performance expectations .

trinibwoy · Mar 2, 2024

Bondrewd said:
you're unlikely to have 300X numbers under MS Azure Early Access so why even talk.

Well that’s the point, there aren’t any 3rd party numbers so nothing to get excited about yet.

Bondrewd said:
DGX is 8 GPUs.

I was thinking of this guy.

NVIDIA Announces DGX GH200 AI Supercomputer

NVIDIA today announced a new class of large-memory AI supercomputer — an NVIDIA DGX™ supercomputer powered by NVIDIA® GH200 Grace Hopper Superchips and the NVIDIA NVLink® Switch System — created to enable the development of giant, next-generation models for generative AI language applications...

nvidianews.nvidia.com

Bondrewd · Mar 2, 2024

Henry swagger said:
Why so angry ? this is a nvidia thread not amd talk

I can't be angry but the usual suspects are being mildly annoying.

Either way, this is not B100 talk.
Dell already spilled the beans they're kilowatt so you're welcome.

trinibwoy · Mar 2, 2024

Bondrewd said:
Not like B100 is anything interesting.
You get two retsized die, 4HBM each. And more math. And more watts. All very by the numbers things that's been happening ever since V100 (it's usually double the ideal workload perf at 30-40% moar power).

That floor plan seems reasonable and probably doesn’t need to be any more interesting in order to achieve its goals. Nvidia’s most interesting changes have been on the internals and software for a while now. The high level SM structure hasn’t changed that much.

Bondrewd said:
Dell already spilled the beans they're kilowatt so you're welcome.

Funny, wonder if Dell got in trouble for casually leaking specs like that.

H100 80GB is at 700w. 1000w for a dual big chip B100 192GB would likely be a significant improvement in perf/watt.

Speculation and Rumors: Nvidia Blackwell ...

Bondrewd

DegustatoR

GPU Domain Specialization via Composable On-Package Architecture | ACM Transactions on Architecture and Code Optimization

Bondrewd

DegustatoR

Bondrewd

Arun

Unknown.

Bondrewd

trinibwoy

Meh

Bondrewd

McHuj

Bondrewd

troyan

trinibwoy

Meh

Deleted member 2197

Guest

Bondrewd

Granath

Deleted member 2197

Guest

trinibwoy

Meh

NVIDIA Announces DGX GH200 AI Supercomputer

Bondrewd

trinibwoy

Meh