Speculation and Rumors: Nvidia Blackwell ...

  • Thread starter Deleted member 2197
  • Start date
Considering there's apparently a 30% transistory density increase I do wonder; where does that put the 96SM rumoured GB203? Because if that is technically smaller than AD103 like maybe 350mm2 I do wonder if that means a 700mm2 GB202 which on N4P/X sounds way more economical than a 600mm2 N3E die.

I’ll believe it when I see it. In a chiplet setup where do you put the “gigathread scheduler”. AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.

My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.
 
AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.
Yes that's the hard part of graphics chiplets.
You have one queue and one context only.
Hello 1995.
My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.
AMD's doing weird distributed scheduling hacks for Navi40 (ded)/50(alive).
We'll see if it works lmao.
 
Yeah but perf/power benefits of N3e should not be disregarded like that.
True but if N4X is being used for GB202 there's probably not much difference anyways. And with 128MB L2 Cache & 512-bit bus there's probably not much die space saveable having it on N3E.
I’ll believe it when I see it. In a chiplet setup where do you put the “gigathread scheduler”. AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.

My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.
Well I think it'll be like A100, something that looks like a proto-MCM die which Bryan tweeted about earlier; two halves that communicate which each other basically being a "prototype".

Also @xpea, how big is GB203? if it's what I suspect with an uneducated guess, is it 350mm2?
 
True but if N4X is being used for GB202 there's probably not much difference anyways.
N3e is appreciably lower power at GPU voltage envelopes (you're not dumping 1.4v innit are you?).
that looks like a proto-MCM die which Bryan tweeted about earlier; two halves that communicate which each other basically being a "prototype".
That's the easy part.
You need to maintain weirdly serial graphics context across multiple dies.
It might break (ok it WILL break in funny ways, wonder how many hacks will be necessary to cope with that).
 
Compared to L2's of yore in general.
Even Ponte Vecchio (lol) had higher per-base L2 bandwidth (before their NOC falls apart) than H100.
MI300 is a whole different beast, L2 isn't LLC there, not comparable. Just like RDNA (and mobile GPUs in general) it has more cache levels.

Your kernels are sliced to sit inside the L1/shmem slab as much as possible, at least on Hopper. See Flashattention2 etc.
View attachment 11028
Remember, on H200 the far L2 partition will be ~same effective b/w and the HBM pile.

It's nominally practical BUT there are caveats.
It's just nicer (way nicer actually) than hitting limited allotments of 400G Ethernet.
For reference the diagram was taken from the Chips and Cheese cdna 3 architectural paper.
 
From the PVC article actually.
PVC is still the bane of my existence. I can't let it go.
Figures. Chips and Cheese almost confirms what you've stated. Didn't AMD provide their MI300X white paper yet?
Assuming what you are saying is true it's pretty certain AMD would have already spilled the beans (for sales prospects) unless things did not turn out as expected.
 
Didn't AMD provide their MI300X white paper yet?
shader ISA manual yes (funnily, they gutted FP6/4 section out of it recently).
Whitepaper too but it's fluff.
They had an ISSCC session on MI300 but it has no NOC topology details, same as their Genoa/Bergamo ones (and it's also rewarmed marketing slides + USR detailed breakdown).
 
shader ISA manual yes (funnily, they gutted FP6/4 section out of it recently).
Whitepaper too but it's fluff.
They had an ISSCC session on MI300 but it has no NOC topology details, same as their Genoa/Bergamo ones.
Well I doubt AMD is any different than Nvidia in terms of providing "fluff" to generate sales. It does seem to be pretty important information not to include in their marketing material.
 
cute and chungus.
Spacing is yeah, it's definitely LSI.
What happened to the narrative that Nvidia was years behind AMD and that they have zero experience in Chiplets?
The truth is what I said: Nvidia has test chips of every single TSMC packaging tech and their iteration. A journey that started more than a decade ago. And BOOM! B100 proves it with a more effective chiplet design than weird extremist MI300X.
Don't get me wrong, MI300X is a marvel of packaging but it's too complex and too early. It's like AMD engineers wanted to show off and prove to the world that they are the best. One Nvidia engineer told me about MI300X packaging "they want to run before knowing how to walk". Nvidia did the right thing with B100. Split the cache on A100, optimize it with H100 and start with only 2 Chiplets on B100 with enough LSI BW to avoid any performance loss. Incremental innovation that WORKS.

That being said, R100 has some crazy stuff going on with it's IOD that it's hard to understand on the surface, even with the packaging diagram in front of you...
 
What happened to the narrative that Nvidia was years behind AMD and that they have zero experience in Chiplets?
from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.
optimize it with H100
They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.
but it's too complex and too early
It's neither complex nor early.
B100 proves it with a more effective chiplet design than weird extremist MI300X.
It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.
Incremental innovation that WORKS.
yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.
That being said, R100 has some crazy stuff going on with it's IOD that it's hard to understand on the surface, even with the packaging diagram in front of you...
N100 is next, it's quad-die, 830mm^2 each.
You're welcome.
 
Also @xpea, how big is GB203? if it's what I suspect with an uneducated guess, is it 350mm2?
I don't know yet which design is the final but N5/N4 has exceptional yields so 2 versions were on the table. One in your ballpark and one much bigger. The later is mainly for the pro client market that can also address gaming if needed. Will know soon when I will go back to Taiwan...
 
from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.

They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.

It's neither complex nor early.

It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.

yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.

N100 is next, it's quad-die, 830mm^2 each.
You're welcome.

For those of us in the back, what are R100 and N100?
 
from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.

They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.

It's neither complex nor early.

It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.

yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.

N100 is next, it's quad-die, 830mm^2 each.
You're welcome.
Oh no again 😂
I don't have time today to debunk your usual BS so quickly before I go to work:
1. Timing matters. In 2021 TSMC had not enough packaging capacity to sustain Nvidia needs. We are not talking about few thousands units and peanuts AMD money here. It's real big boy business.

2. Wonder yourself why H100 doesn't have the best H2 perf then you will have your answer. It's logical in Nvidia road to Chiplets.

3. MI300X inter die BW is not enough and not optimal. They should have solved the BW problem with 2 dies first before jumping to more complex topology. Same mistake as PVC disaster.

4. LOL thanks captain obvious to tell us that a chip is designed few years before hitting the market. Oh same logic applies to everybody. What's next? Water is wet?

4. You are wrong but I obviously can't say more.

You're welcome too and have a nice day
 
For those of us in the back, what are R100 and N100?
N100 is the B100 successor.
Even more Si spam, 4 830mm^2 in quadrants.
What's further no idea since that's where NV packaging commitments end for now.
In 2021 TSMC had not enough packaging capacity to sustain Nvidia needs.
Yea they did.
AMD wasn't even using TSMC, it was an ASE FOEB part (MI200, that is).
Wonder yourself why H100 doesn't have the best H2 perf then you will have your answer
The primary partition L2 was/is also slow.
You pipe into L1/shmem slab or you die.
LOL thanks captain obvious to tell us that a chip is designed few years before hitting the market
?
AMD shipped MALL (the chungus SLC) in N21, exactly a bit over 3 years ago in december'20.
SoIC was H1'22 with 5800X3D.
MI300X inter die BW is not enough and not optimal
Like twice of what B100 has.
It's enough.
Weird cope but w/ever.
 
There is no 'unofficial info'. You're literally just posting pure, blind speculation. smh
Sure. Pure, blind speculation.


Do note that this blurb on part of WCCFTech:
marking a considerable bump from the Hopper generation
Is just another misinformation. 30-40KK is inside the pricing range of Hopper generation of products. B200 will probably get to the same price in the high end eventually.

For those of us in the back, what are R100 and N100?
R100 is Rubin, the next DC μarch after Blackwell.
N100 was shown in some leaked slide IIRC but I think that was just a placeholder for R100 as in "Next-100".
 
Last edited:
I don't know yet which design is the final but N5/N4 has exceptional yields so 2 versions were on the table. One in your ballpark and one much bigger. The later is mainly for the pro client market that can also address gaming if needed. Will know soon when I will go back to Taiwan...
Was the 2nd version more than 96SM one (116? 108? SMs was it)? And the 1st being the one in my ballpark w/ 96 SMs? I presume with 30% higher transistor density with Blackwell a 96SM GB203 would be slightly smaller than AD103 of amybe 350mm2.
 
Back
Top