Speculation and Rumors: Nvidia Blackwell ...

trinibwoy · Mar 19, 2024

Dangerman said:
Considering there's apparently a 30% transistory density increase I do wonder; where does that put the 96SM rumoured GB203? Because if that is technically smaller than AD103 like maybe 350mm2 I do wonder if that means a 700mm2 GB202 which on N4P/X sounds way more economical than a 600mm2 N3E die.

I’ll believe it when I see it. In a chiplet setup where do you put the “gigathread scheduler”. AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.

My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.

Bondrewd · Mar 19, 2024

trinibwoy said:
AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.

Yes that's the hard part of graphics chiplets.
You have one queue and one context only.
Hello 1995.

trinibwoy said:
My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.

AMD's doing weird distributed scheduling hacks for Navi40 (ded)/50(alive).
We'll see if it works lmao.

Dangerman · Mar 19, 2024

Bondrewd said:
Yeah but perf/power benefits of N3e should not be disregarded like that.

True but if N4X is being used for GB202 there's probably not much difference anyways. And with 128MB L2 Cache & 512-bit bus there's probably not much die space saveable having it on N3E.

trinibwoy said:
I’ll believe it when I see it. In a chiplet setup where do you put the “gigathread scheduler”. AFAIK graphics APIs don’t expose multiple graphics contexts and you wouldn’t want applications to attempt load balancing Direct3D commands across chiplets anyway.

My wild ass guess is that the graphics manifestation of chiplets will have a scheduler/IO chiplet surrounded by compute chiplets with the scheduler being a single point of entry for graphics work.

Well I think it'll be like A100, something that looks like a proto-MCM die which Bryan tweeted about earlier; two halves that communicate which each other basically being a "prototype".

Also @xpea, how big is GB203? if it's what I suspect with an uneducated guess, is it 350mm2?

Bondrewd · Mar 19, 2024

Dangerman said:
True but if N4X is being used for GB202 there's probably not much difference anyways.

N3e is appreciably lower power at GPU voltage envelopes (you're not dumping 1.4v innit are you?).

Dangerman said:
that looks like a proto-MCM die which Bryan tweeted about earlier; two halves that communicate which each other basically being a "prototype".

That's the easy part.
You need to maintain weirdly serial graphics context across multiple dies.
It might break (ok it WILL break in funny ways, wonder how many hacks will be necessary to cope with that).

Bondrewd · Mar 19, 2024

https://twitter.com/i/web/status/1770137268533002571

cute and chungus.
Spacing is yeah, it's definitely LSI.

pharma · Mar 19, 2024

Bondrewd said:
Compared to L2's of yore in general.
Even Ponte Vecchio (lol) had higher per-base L2 bandwidth (before their NOC falls apart) than H100.
MI300 is a whole different beast, L2 isn't LLC there, not comparable. Just like RDNA (and mobile GPUs in general) it has more cache levels.

Your kernels are sliced to sit inside the L1/shmem slab as much as possible, at least on Hopper. See Flashattention2 etc.
View attachment 11028
Remember, on H200 the far L2 partition will be ~same effective b/w and the HBM pile.

It's nominally practical BUT there are caveats.
It's just nicer (way nicer actually) than hitting limited allotments of 400G Ethernet.

For reference the diagram was taken from the Chips and Cheese cdna 3 architectural paper.

Bondrewd · Mar 19, 2024

pharma said:
For reference the diagram was taken from the Chips and Cheese cdna 3 architectural paper.

From the PVC article actually.
PVC is still the bane of my existence. I can't let it go.

pharma · Mar 19, 2024

Bondrewd said:
From the PVC article actually.
PVC is still the bane of my existence. I can't let it go.

Figures. Chips and Cheese almost confirms what you've stated. Didn't AMD provide their MI300X white paper yet?
Assuming what you are saying is true it's pretty certain AMD would have already spilled the beans (for sales prospects) unless things did not turn out as expected.

Bondrewd · Mar 19, 2024

pharma said:
Didn't AMD provide their MI300X white paper yet?

shader ISA manual yes (funnily, they gutted FP6/4 section out of it recently).
Whitepaper too but it's fluff.
They had an ISSCC session on MI300 but it has no NOC topology details, same as their Genoa/Bergamo ones (and it's also rewarmed marketing slides + USR detailed breakdown).

pharma · Mar 19, 2024

Bondrewd said:
shader ISA manual yes (funnily, they gutted FP6/4 section out of it recently).
Whitepaper too but it's fluff.
They had an ISSCC session on MI300 but it has no NOC topology details, same as their Genoa/Bergamo ones.

Well I doubt AMD is any different than Nvidia in terms of providing "fluff" to generate sales. It does seem to be pretty important information not to include in their marketing material.

Seanspeed · Mar 19, 2024

DegustatoR said:
With Blackwell GPUs, AI Gets Cheaper And Easier, Competing With Nvidia Gets Harder

If you want to take on Nvidia on its home turf of AI processing, then you had better bring more than your A game. You better bring your A++ game, several

www.nextplatform.com

And unofficial info is that there won't be any "premium" at all. But the last sentence is certainly true.

There is no 'unofficial info'. You're literally just posting pure, blind speculation. smh

xpea · Mar 19, 2024

Bondrewd said:
cute and chungus.
Spacing is yeah, it's definitely LSI.

What happened to the narrative that Nvidia was years behind AMD and that they have zero experience in Chiplets?
The truth is what I said: Nvidia has test chips of every single TSMC packaging tech and their iteration. A journey that started more than a decade ago. And BOOM! B100 proves it with a more effective chiplet design than weird extremist MI300X.
Don't get me wrong, MI300X is a marvel of packaging but it's too complex and too early. It's like AMD engineers wanted to show off and prove to the world that they are the best. One Nvidia engineer told me about MI300X packaging "they want to run before knowing how to walk". Nvidia did the right thing with B100. Split the cache on A100, optimize it with H100 and start with only 2 Chiplets on B100 with enough LSI BW to avoid any performance loss. Incremental innovation that WORKS.

That being said, R100 has some crazy stuff going on with it's IOD that it's hard to understand on the surface, even with the packaging diagram in front of you...

Bondrewd · Mar 19, 2024

xpea said:
What happened to the narrative that Nvidia was years behind AMD and that they have zero experience in Chiplets?

from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.

xpea said:
optimize it with H100

They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.

xpea said:
but it's too complex and too early

It's neither complex nor early.

xpea said:
B100 proves it with a more effective chiplet design than weird extremist MI300X.

It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.

xpea said:
Incremental innovation that WORKS.

yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.

xpea said:
That being said, R100 has some crazy stuff going on with it's IOD that it's hard to understand on the surface, even with the packaging diagram in front of you...

N100 is next, it's quad-die, 830mm^2 each.
You're welcome.

xpea · Mar 19, 2024

Dangerman said:
Also @xpea, how big is GB203? if it's what I suspect with an uneducated guess, is it 350mm2?

I don't know yet which design is the final but N5/N4 has exceptional yields so 2 versions were on the table. One in your ballpark and one much bigger. The later is mainly for the pro client market that can also address gaming if needed. Will know soon when I will go back to Taiwan...

trinibwoy · Mar 19, 2024

Bondrewd said:
from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.

They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.

It's neither complex nor early.

It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.

yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.

N100 is next, it's quad-die, 830mm^2 each.
You're welcome.

For those of us in the back, what are R100 and N100?

xpea · Mar 19, 2024

Bondrewd said:
from that POV Apple was ahead of the universe since it shipped LSI in 2021.
I mean, AMD shipped FOEB in 2021 too.
Just not for d2d between MI200s, though they floated the idea.
So they still are.
Congrats.

They didn't, H100 is notorious for dogshit compute/L2 bandwidth ratio and don't you dare hitting the far L2.

It's neither complex nor early.

It's neither weird nor extremist outside of XCD being maybe a bit too tiny.
Which they're gonna converge in MI400 anyway.

yea no shit, MALL is 3 years old, and AMD SoIC implementations are 2 years old.
You're welcome, it's all in MI300 ISSCC slide deck.

N100 is next, it's quad-die, 830mm^2 each.
You're welcome.

Oh no again

I don't have time today to debunk your usual BS so quickly before I go to work:
1. Timing matters. In 2021 TSMC had not enough packaging capacity to sustain Nvidia needs. We are not talking about few thousands units and peanuts AMD money here. It's real big boy business.

2. Wonder yourself why H100 doesn't have the best H2 perf then you will have your answer. It's logical in Nvidia road to Chiplets.

3. MI300X inter die BW is not enough and not optimal. They should have solved the BW problem with 2 dies first before jumping to more complex topology. Same mistake as PVC disaster.

4. LOL thanks captain obvious to tell us that a chip is designed few years before hitting the market. Oh same logic applies to everybody. What's next? Water is wet?

4. You are wrong but I obviously can't say more.

You're welcome too and have a nice day

Bondrewd · Mar 19, 2024

trinibwoy said:
For those of us in the back, what are R100 and N100?

N100 is the B100 successor.
Even more Si spam, 4 830mm^2 in quadrants.
What's further no idea since that's where NV packaging commitments end for now.

xpea said:
In 2021 TSMC had not enough packaging capacity to sustain Nvidia needs.

Yea they did.
AMD wasn't even using TSMC, it was an ASE FOEB part (MI200, that is).

xpea said:
Wonder yourself why H100 doesn't have the best H2 perf then you will have your answer

The primary partition L2 was/is also slow.
You pipe into L1/shmem slab or you die.

xpea said:
LOL thanks captain obvious to tell us that a chip is designed few years before hitting the market

?
AMD shipped MALL (the chungus SLC) in N21, exactly a bit over 3 years ago in december'20.
SoIC was H1'22 with 5800X3D.

xpea said:
MI300X inter die BW is not enough and not optimal

Like twice of what B100 has.
It's enough.
Weird cope but w/ever.

DegustatoR · Mar 19, 2024

Seanspeed said:
There is no 'unofficial info'. You're literally just posting pure, blind speculation. smh

Sure. Pure, blind speculation.

NVIDIA Blackwell GPUs Cost Around $30K-$40K, $10 Billion Development Cost For The Fastest AI Chip On The Planet

NVIDIA's Blackwell will cost a hefty price for potential buyers, as the firm is estimated to pour several billion dollars in the project.

wccftech.com

Do note that this blurb on part of WCCFTech:

marking a considerable bump from the Hopper generation

Is just another misinformation. 30-40KK is inside the pricing range of Hopper generation of products. B200 will probably get to the same price in the high end eventually.

trinibwoy said:
For those of us in the back, what are R100 and N100?

R100 is Rubin, the next DC μarch after Blackwell.
N100 was shown in some leaked slide IIRC but I think that was just a placeholder for R100 as in "Next-100".

Dangerman · Mar 19, 2024

xpea said:
I don't know yet which design is the final but N5/N4 has exceptional yields so 2 versions were on the table. One in your ballpark and one much bigger. The later is mainly for the pro client market that can also address gaming if needed. Will know soon when I will go back to Taiwan...

Was the 2nd version more than 96SM one (116? 108? SMs was it)? And the 1st being the one in my ballpark w/ 96 SMs? I presume with 30% higher transistor density with Blackwell a 96SM GB203 would be slightly smaller than AD103 of amybe 350mm2.

Qesa · Mar 19, 2024

Bondrewd said:
Like twice of what B100 has.

MI300 has 4.8 TB/s bidirectional bisection bandwidth, that's less than half B100

Speculation and Rumors: Nvidia Blackwell ...

trinibwoy

Meh

Bondrewd

Dangerman

Bondrewd

Bondrewd

pharma

Bondrewd

pharma

Bondrewd

pharma

Seanspeed

With Blackwell GPUs, AI Gets Cheaper And Easier, Competing With Nvidia Gets Harder

xpea

Bondrewd

xpea

trinibwoy

Meh

xpea

Bondrewd

DegustatoR

NVIDIA Blackwell GPUs Cost Around $30K-$40K, $10 Billion Development Cost For The Fastest AI Chip On The Planet

Dangerman

Qesa