Speculation and Rumors: Nvidia Blackwell ...

  • Thread starter Deleted member 2197
  • Start date
So not exactly "zero" apparently.
That's not 3D. They also did a paper on basic 2D/2.5D signaling like two years ago (which, the former is GH200 and the latter is B100).
Nothing at the scale of Intel Labs efforts or AMD Research publishing the paper that became MI300A in like the 2015 round of DOE FastForward2 programs.
 
Also this paper is from 2021.
It's from very late 2021 and effectively is 2yo.
Itself is mostly 2.5D, they don't really focus on fancy 3D APU design proposals much.
Compare it to AMD's Fastforward2 EHP paper and...
Either way Intel's technically ahead of all in 2.5D/3D race until a while, just that their attempts at it in GPGPU failed hard for reasons unrelated (rip ATS-2T/4T and holy shit PVC oh no. just no).

Packaging is CPU lands expertise, Intel in particular was the first mover in many things (FCBGA, ABI laminates, active interposers with LKF etc).
 
This is expensive low throughput packaging that's not suitable for mainstream parts.
Agreed, I can't see 3D packaging being cost effective and high enough volume for an AD104-level product in 2024 so not for Blackwell, but eventually it should come down in price.

It's up.
It's actually up.
N3e wins you on power/perf but it costs more per xtor which is stinky.
It's a real issue and N2 is even worse wrt that (to the point of majorly influencing packaging decisions for Venice -dense).
No, you're still focusing on price per transistor, but I claimed something more subtle & important: "logic transistor times iso-power performance". The "logic" excludes I/O and SRAM transistors which are not scaling. And if you take NVIDIA SMs and increase their clock frequency by 15% at iso-power, you only need 100 SMs to achieve the same performance as 115 SMs (excluding latency tolerance).

N3E logic area is supposedly 0.625x of N5 and iso-power perf is +18% so "(1/0.625)*1.18 = 1.89" = +89% *logic-only* performance for the same area (old numbers from TSMC, I think it might actually be slightly better now for some FinFlex variants, I'm not entirely sure). Even if you believe N3E really is 20K/wafer and 25% more expensive than N5, that's still 42% higher perf/$ (assuming comparable yields). Unfortunately, that's only logic transistors, so without 2.5D/3D packaging to get rid of all I/O and much of the SRAM, the overall perf/$ isn't good.
NV has no experience or any real pathfinding into doing very fancy 3D stuff, it's all Intel and AMD (all things MCM in general are CPU land since many-many decades ago).
This isn't really their job: it's TSMC's job. AMD and other fabless companies can and do contribute significantly but fabless customers don't typically lead or own that kind of research AFAIK. So it shouldn't be an issue being a fast follower without doing as much of that kind of work yourself. That's also partly why Intel seems so far ahead: they are marketing what their fabs/foundries are capable of far in advance of product availability. Finally, NVIDIA or other fabless companies are under no obligation to publish any of their internal testing or research (unless they are required/encouraged to do so for contracts with the DoE/DARPA/etc.)

Both B100 and N100 are very simple straightforward products (big retsized die times n on CoWoS-L).
Are you saying multiple identical dies in a 2.5D package ala Apple or AMD MI250X (but working as a single GPU ala MI300X)?

If we're only talking about the AI flagship, and it's 2.5D rather than 3D, then sure, maybe. That sounds like ~1500W of power to me though (compared to Hopper which is 700W on 4N for a single retsized die) which isn't realistic. So it'd have to be significantly undervolted with lower clocks than H100. I stumbled upon this article which claims 1000W for B100: https://asia.nikkei.com/Spotlight/Supply-Chain/AI-boom-drives-demand-for-server-cooling-technology

You'll have to forgive me if I don't share your confidence and don't really believe you, but I have been around this circus enough times in the last 20+ years around major GPU architecture releases (e.g. NV30/NV40/G80/Fermi/Volta/etc. and R420/R600/GCN/etc. which were much bigger steps than the ones in-between) and the rumours are nearly always completely wrong, even more so than usual. So I'm not going to take anything too seriously at this stage (and you definitely shouldn't take anything I say too seriously either!)
 
but eventually it should come down in price.
It's not really that expensive, just a throughput problem.
d2w hybrid bonding is a pretty slow process.
And if you take NVIDIA SMs and increase their clock frequency by 15% at iso-power, you only need 100 SMs to achieve the same performance as 115 SMs (excluding latency tolerance).
Yeah but that's borderline whatever, incremental bump not a single soul on the planet cares about.
People bitch about 10% perf/$ bumps as is, spacing them over distinct full node shrinks only gonna make the situation worse.
N3E logic area is supposedly 0.625x of N5 and iso-power perf is +18% so "(1/0.625)*1.18 = 1.89" = +89% *logic-only* performance for the same area (old numbers from TSMC, I think it might actually be slightly better now for some FinFlex variants, I'm not entirely sure)
GPUs are like half SRAM.
It just doesn't work out in the end at all.
Even if you believe N3E really is 20K/wafer and 25% more expensive than N5
It's 22k wafers versus 16-17k N5 derivatives.
and much of the SRAM
You can't, SM/CUs themselves contain the bulk of GPU SRAM.
This isn't really their job
No it's definitely theirs. TSMC can only do the actual w2w/d2w bonding, but for that you need to design the system around it and that's the hard part.
Please consult MI300 ISSCC slideware, I assume you have IEEE membership for that.
AMD and other fabless companies can and do contribute significantly but fabless customers don't typically lead or own that kind of research AFAIK
Oh no they do a lot.
Fabless is a borderline misnomer these days given how involved both design and pathfinding for bleeding edge node is from a customer POV.
(unless they are required/encouraged to do so for contracts with the DoE/DARPA/etc.)
Yea that's how we get stuff done. DOE programs.
Are you saying multiple identical dies in a 2.5D package ala Apple or AMD MI250X (but working as a single GPU ala MI300X)?
Yes and maybe. Probably not. Possible Superchip™ hackery.
That sounds like ~1500W of power to me though
Downclock it and et voila.
Also remember that B100 is not 100% more HBM, just 33% more and obviously the I/O is mostly the same even if higher signaling rates.
I stumbled upon this article which claims 1000W for B100:
That's correct.
You'll have to forgive me if I don't share your confidence and don't really believe you, but I have been around this circus enough times in the last 20+ years around major GPU architecture releases (e.g. NV30/NV40/G80/Fermi/Volta/etc. and R420/R600/GCN/etc. which were much bigger steps than the ones in-between) and the rumours are nearly always completely wrong, even more so than usual.
I'm every bit as old as you are so no worries.
 
Superchip is multiple packages on the same PCB right?
Just applies with anything NUCA and more than one die I think.
The branding is weird anyway, since Grace-Grace is just a standard 2p system on a stick.

B100 is two die on one CoWoS-L poser (that's goo with 25um pitch bridges).
 
B100 is two die on one CoWoS-L poser (that's goo with 25um pitch bridges).

4 HBM stacks per die. I don’t think they can manufacture a single die with enough phy space for 8 HBM stacks.

Also much increased FP8 performance and even smaller native FP data type support.
 
Warning. Stay on technical topics and improve the SNR. Warring is not acceptable.
Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster? Now compare that to MI300X which has 3x more transistors for barely better performance than H100.

nVidia has an architecture which is cutting edge. They dont need "fancy" chiplets. Blackwell will instantly kill MI300X. AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...
 
4 HBM stacks per die. I don’t think they can manufacture a single die with enough phy space for 8 HBM stacks.

Also much increased FP8 performance and even smaller native FP data type support.

Why not 6 stacks + NVLink like H100? Too much power? Nvidia is vulnerable long term on memory capacity as generative models can’t get enough. 12 x 12 high HBM3e stacks would help hold the fort. AMD will almost certainly be at 8 x 12 in the nearish future.

As much as SRAM doesn’t scale I still wouldn’t bet on Nvidia moving to a small L2 and big off-die L3 with Blackwell. If that’s where they want to go eventually it would make more sense to work out the L3 kinks on-die first similar to AMD.
 
Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster? Now compare that to MI300X which has 3x more transistors for barely better performance than H100.

nVidia has an architecture which is cutting edge. They dont need "fancy" chiplets. Blackwell will instantly kill MI300X. AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...
I think we need to see how well GH200 performs in the upcoming MLPerf tests against MI300 series. Then we can better speculate architectural uplifts needed to maintain pace with Blackwell and the MI400 counterpart. Architecture reveal should happen March 21 at GTC 2024.
 
Did you know that Hopper has just 50% more transistors than Ampere and can be up to be 3x faster?
"up to" isn't a metric, and Hopper has horrible vibes wrt getting any perf out of the thing.
for barely better performance than H100.
proofs?
Blackwell will instantly kill MI300X.
"MI400A will instantly kill B100".
So what?
AMD is even so desperate that they have already announced to switch to HBM3e as soon as possible - only months after the release of MI300X...
bro that's cope, the refresh was planned eoy'22.
Why not 6 stacks + NVLink like H100?
Need 4x ret interposer (well, CoWoS-L) for that which isn't due until 2025 iirc.
 
"up to" isn't a metric, and Hopper has horrible vibes wrt getting any perf out of the thing.

proofs?

"MI400A will instantly kill B100".
So what?

bro that's cope, the refresh was planned eoy'22.

Need 4x ret interposer (well, CoWoS-L) for that which isn't due until 2025 iirc.
MI400 will be a year later than B100
 
Folks, let's revert the conversation back to Blackwell and wait for benchmarks to corroborate some of this information.
GTC 2024 and MLPerf are just around the corner and should be plenty performance data on the current offering and Blackwell performance expectations .
 
you're unlikely to have 300X numbers under MS Azure Early Access so why even talk.

Well that’s the point, there aren’t any 3rd party numbers so nothing to get excited about yet.

DGX is 8 GPUs.

I was thinking of this guy.

 
Not like B100 is anything interesting.
You get two retsized die, 4HBM each. And more math. And more watts. All very by the numbers things that's been happening ever since V100 (it's usually double the ideal workload perf at 30-40% moar power).

That floor plan seems reasonable and probably doesn’t need to be any more interesting in order to achieve its goals. Nvidia’s most interesting changes have been on the internals and software for a while now. The high level SM structure hasn’t changed that much.

Dell already spilled the beans they're kilowatt so you're welcome.

Funny, wonder if Dell got in trouble for casually leaking specs like that.

H100 80GB is at 700w. 1000w for a dual big chip B100 192GB would likely be a significant improvement in perf/watt.
 
Back
Top