Nvidia Blackwell Architecture Speculation

trinibwoy · Jan 30, 2025

arandomguy said:
For example if people wanted to see more architectural improvements with Blackwell what do they mean specifically and what do they hope that would have translated to in terms of the end product?

We’ve typically distilled architectural efficiency into perf/w or perf/mm^2 but both those metrics are heavily influenced by process tech. I like to think of it in terms of “can we do better with the same number of transistors on the same node”. It’s an academic thought at best since nobody evolves architectures that way. Until now

I’d like to think though that if Nvidia was forced to design a reticle limited chip on 4N from scratch they could do better than GB202. One problem with that idea is that the software side of things likely puts a limit on how much actual performance you can wring out of any hardware design. Better software and APIs alongside a complementary clean sheet hardware design will almost certainly put GB202 to shame. I don’t think we’re anywhere near peak efficiency.

RobertR1 · Jan 30, 2025

raytracingfan said:
I wonder if the rumors of a 600W TDP RTX 4090 were actually true, but Nvidia decided to dial it down because they already knew Blackwell wouldn't have any performance-per-watt gain over Lovelace and they needed room for a performance boost through TDP boost with the 5090.

A 4090 doesn’t power scale as well.

Rys · Jan 30, 2025

trinibwoy said:
It’s an academic thought at best since nobody evolves architectures that way.

I’m not sure where that notion is coming from but it’s wide of the mark. Improving PPA iso-node is always a prime consideration. It’s especially true for GPUs since they tend to be N or N-1 leading node and expensive to manufacture, and it’s not guaranteed you’ll always want to or be able to use the next node.

Subtlesnake · Jan 30, 2025

Boss said:
That's fair, I missed that part..... However, I wasn't shit posting at all, I was just being honest. If you go back a decade plus, we haven't see uplifts this bad. To release a generation with almost 0 ipc improvement? I can't recall the last time Nvidia released such a product that didn't come along with a huge price cut. I don't even want to delve into the price discussions as it's a massive distraction. Personally speaking, for me, there's no justifiable reason for a new generation if you can't deliver a meaningful increase in IPC. The only exception to this may be node shrinks. As node shrinks get more difficult, more emphasis will be placed on the architecture to seek out performance gains. If this is the type of performance we can expect from Nvidia, then I'm not too hopeful at all. How many real node shrinks do we have left? 2-4 and then its cooked? Each of those nodes will be extremely expensive.

I haven't seen any per clock, per SM comparisons of Ada vs Ampere, however on a per FLOP basis performance seems to be very close. While, RDNA 3 only delivered a ~5% raster boost on a per-clock, per-CU basis (https://www.computerbase.de/artikel...nitt_rdna_3_kann_auch_deutlich_schneller_sein) and RDNA 2 actually saw a performance regression vs RDNA 1 under the same testing (https://www.computerbase.de/artikel...dna_2_vs_rdna_sind_die_cus_schneller_geworden).

Though even the 5% boost in RDNA 3 involved additional transistor investment, while Blackwell seems to be basically flat vs Ada, and still delivers both extra performance and features.

Edit: For comparison, the 7800 XT had ~5% more performance than the 6800 XT, for 5% more transistors, and that's being favourable, as the 6800 XT was using a cut-down chip.

trinibwoy · Jan 30, 2025

Rys said:
I’m not sure where that notion is coming from but it’s wide of the mark. Improving PPA iso-node is always a prime consideration. It’s especially true for GPUs since they tend to be N or N-1 leading node and expensive to manufacture, and it’s not guaranteed you’ll always want to or be able to use the next node.

The notion is coming from the products that come to market. I’m sure you’re right about how it works internally but that’s not what we see as consumers of gaming chips. How often have we seen a new generation of chips on the same node as the previous?

DegustatoR · Jan 30, 2025

trinibwoy said:
How often have we seen a new generation of chips on the same node as the previous?

More often than you'd think actually.
RDNA1 and RDNA2, Pascal and Turing, Lovelace and Blackwell, Kepler and Maxwell, GCN4 and Vega.

trinibwoy · Jan 30, 2025

DegustatoR said:
More often than you'd think actually.
RDNA1 and RDNA2, Pascal and Turing, Lovelace and Blackwell, Kepler and Maxwell, GCN4 and Vega.

Maxwell was a real banger. Forgot that was the same node. Makes Blackwell look even more suspect in comparison.

Picao84 · Jan 30, 2025

arandomguy said:
So I made this table to look at how much raw performance was gained gen on gen relative to the spend on transistors. Hopfully it might give some perspective on what architectural gains per gen and where it comes from.

GPU Process Memory Transistors (Billions) Transistor Count Gain Performance Gain Performance/Transistor Count Gain
GTX 770/GK104 28nm GDDR5
3.54
GTX 980/GM204 28nm GDDR5
5.2
1.468927
1.52
1.034769
cache increase
GTX 1080/GP104 16nm GDDR5X
7.2
1.384615
1.67
1.206111
RTX 2070/TU106 12nm (16nm+) GDDR6
10.8
1.5
1.15
0.766667
Tensor RT
RTX 3070ti/GA104 8nm (Samsung 12nm+) GDDR6X
17.4
1.611111
1.64
1.017931
RTX 4080Super/AD103 5nm GDDR6X
45.9
2.637931
1.79
0.678562
cache increase
RTX 5080/GB203 5nm GDDR7
45.6
0.993464
1.1
1.107237

The above table is just a rough look at transistor spending impacted. I tried to pick the fullest implemention as possible for the GPU dies mentioned and tried for memory width consistency.

The performance numbers were taken from TPUs review for each product and using the aggregate numbers for 1440p over the previous generation. I've mentioned the process and memory technologies as those should be factored in. I've also put notes on the 2 gens where you likely had a lot of tranistors spent on cache and new functionality.

You can take from it what you will but I'm guessing it's going to depend on what you feel the reality is in terms of the actual cost per transistor today and going forward. As well as the state of process node gains going forward.

Edit: Just to add since this was asked I picked those 04/03 dies as they had fully implemented consumer products. Since we are looking at transistor counts this isolates for utlization unlike say going with the 02 dies which are all cut to varying ratios.

You should add power consumption figures. I recall that the impressive thing between Kepler and Maxwell was not only performance but also a big jump in power efficiency. A GTX 750 Ti 4GB, which didn't need an external power connector was nearly as fast as the 650 Ti Boost (a salvaged GTX660 that was less than 10% slower), while consuming half of the energy. This performance was not due to more transistors, since GM107 had 1.87 Million Transistors Vs GK106 2.5! It was due to an architecture change to use tile rendering, AFAIK. Plus this was extended to all the other tiers since GTX980 was slightly faster than the GTX 780 Ti while using 1/3 less power.
For this reason your table is wrong actually, since you didn't add GK110. That part has 7 Million Transistors, while GM204 beats it with only 5.2 Million! The GK104 you are comparing with was not used for the *80 tier of the previous generation, invalidating your analysis a bit since you are not really comparing same the same performance tiers, just the chip codenamed. The last few years have shown us that codenames mean nothing in terms of product positioning, so I don't think using them to make this sort of comparisons is viable anymore. There is too much nuance and you end up ignoring facts like the above.
Nvidia increased efficiency in a incredible way while keeping the same node in the transition to Maxwell, something that we definitely not see with Blackwell at all, outside of MFG.

techuse · Jan 30, 2025

arandomguy said:
I'm assuming by PPW you mean performance per watt? If so how would you isolate this from the process node?

If we want to look at new architectures on the same node (or close to) from history using TPUs energy efficiency numbers -

GTX 680->980 at 50% is the only one that had significant PPW increases in gaming purely I guess from what you can isolate to the architecture changes itself (Maxwell did spend on cache as well).

GTX 1080->2070 (and this was still 16nm->12nm) had none.

RTX 4080 Super-> RTX 5080 has 10% (newer memory).

Outside of Nvidia, AMD with the 6600XT ->7600XT (7nm->6nm) had none.

5700XT->6900XT had 12% (likely power savings from cache).

This one is tricky because Tonga was only ever released salvaged dies for DIY (from what I remember the best ones went to Apple) with no reference. But in terms of either 7850 or 7870 to r9 285 it actually shows PPW regression, with the older architecture being 30% more efficient.

If we want to use this comparison but r9 290x references to the Fury Nano (HBM, but both still 28nm) also had 50% more PPW.

Yes I meant perf per watt. You would need two GPUs on the same node to have a proper comparison.

1080ti->2080ti 35%.

5700xt->6700xt 35%

Comparisons should be made with GPUs targeting as similar power envelopes as possible to be the most representative.

We don't have many examples due to the frequent silicon progress we have previously enjoyed. Now though, it seems architectural progress will have to be relied upon as manufacturing improvements are nearly at a dead end.

DegustatoR · Jan 30, 2025

trinibwoy said:
Maxwell was a real banger. Forgot that was the same node. Makes Blackwell look even more suspect in comparison.

Not really. You can't expect designs to improve by as much with each generation since Maxwell was very much a one off thing which was never repeated again to the same degree by any vendor.

Picao84 · Jan 30, 2025

DegustatoR said:
Not really. You can't expect designs to improve by as much with each generation since Maxwell was very much a one off thing which was never repeated again to the same degree by any vendor.

If you really believe that than you either believe that there is no other way to scale up now but through AI, or you believe the only way to do that is by increasing power budgets massively (of which the 5090 is an example). Are we really at the Netburst age of GPUs?

DegustatoR · Jan 30, 2025

Picao84 said:
If you really believe that than you either believe that there is no other way to scale up now but through AI, or you believe the only way to do that is by increasing power budgets massively (of which the 5090 is an example). Are we really at the Netburst age of GPUs?

Scaling up has little to do with architectural improvements and is mostly a result of production process improvements. It is pretty clear that scaling is very much dying if not dead already so the natural expectation is that we'll have to wait longer for a sizeable perf/price improvements. Don't see how this can be avoided right now. Also don't see any relation between Blackwell and NetBurst.

AI is mainstream now, everyone is looking into using it for graphical improvements. So that's one option - but you'll still need to scale up the AI part of the h/w for these improvements to be sizeable and this is limited by the same realities as shading. There are some obvious gains there (upscaling, etc) but beyond that it will likely hit the same wall.

The way I see it there are two possibilities: 1) Nvidia don't care about their business and thus won't put any effort into improving their GPU architectures. You tell me if that sounds correct to you.
2) Nvidia with all its R&D doesn't see any obvious way of improving their shading h/w beyond what they already has. Note that they are making improvements in scheduling, data formats support, RT h/w, memory controllers, even data fetching - these come at low cost and don't show up in an apparent way in older s/w though. So saying that there are no improvements would be lying basically - it's just that they aren't bringing as high of an impact as they used to. But that's general way of how things are for the whole semi industry right now, Nvidia isn't any different to any other silicon maker in this.

techuse · Jan 30, 2025

DegustatoR said:
Scaling up has little to do with architectural improvements and is mostly a result of production process improvements. It is pretty clear that scaling is very much dying if not dead already so the natural expectation is that we'll have to wait longer for a sizeable perf/price improvements. Don't see how this can be avoided right now. Also don't see any relation between Blackwell and NetBurst.

AI is mainstream now, everyone is looking into using it for graphical improvements. So that's one option - but you'll still need to scale up the AI part of the h/w for these improvements to be sizeable and this is limited by the same realities as shading. There are some obvious gains there (upscaling, etc) but beyond that it will likely hit the same wall.

The way I see it there are two possibilities: 1) Nvidia don't care about their business and thus won't put any effort into improving their GPU architectures. You tell me if that sounds correct to you.
2) Nvidia with all its R&D doesn't see any obvious way of improving their shading h/w beyond what they already has. Note that they are making improvements in scheduling, data formats support, RT h/w, memory controllers, even data fetching - these come at low cost and don't show up in an apparent way in older s/w though. So saying that there are no improvements would be lying basically - it's just that they aren't bringing as high of an impact as they used to. But that's general way of how things are for the whole semi industry right now, Nvidia isn't any different to any other silicon maker in this.

I don't think those are the only possibilities.

3 - With no competition they don't think they need to.
4 - It would be more expensive/cut into their margins.
5 - May require too much work on the driver front.
6 - Opposing tech ideas from different engineers resulting in a stalemate.

Picao84 · Jan 30, 2025

Duplicate Error.

Picao84 · Jan 30, 2025

DegustatoR said:
Scaling up has little to do with architectural improvements and is mostly a result of production process improvements. It is pretty clear that scaling is very much dying if not dead already so the natural expectation is that we'll have to wait longer for a sizeable perf/price improvements.

We must have different ideas about what considers architecture scaling because the history of GPUs is full of examples? Going from specific units for pixel and vertex shaders to the unified shader model that improved GPU utilisation by avoiding stalled units, while also bringing about the age of GPU compute being one. You could have avoided all that, continued with task specific units and just use process improvements to increase performance. I doubt we would have arrived at where we are today though. So I can't really understand your point of view that scaling up has little to do with architectural improvements. It's like you are denying the work of thousands of engineers in their creativity and problem solving capabilities? After all, if only process improvement matters why does Nvidia spends so much in R&D? By your assessment they just had to copy paste an architecture, multiply it and adapt to the new node. Job done. Clearly that's not how it works.

DegustatoR said:
Don't see how this can be avoided right now. Also don't see any relation between Blackwell and NetBurst.

The relation is because like you said below, we are probably hitting a wall. Intel thought their architecture could just scale up (ironically like you are implying above by saying architecture matters little) and they fell flat on their face. We never got to the 10Ghz CPUs they expected and had to change architecture towards a multi core solution. AMD beat them to it but later on they came with a brand new architecture to overcome "the wall". Do you understand now the relation I was doing? Technically they are very different things of course, but the situation seems similar.

DegustatoR said:
AI is mainstream now, everyone is looking into using it for graphical improvements. So that's one option - but you'll still need to scale up the AI part of the h/w for these improvements to be sizeable and this is limited by the same realities as shading. There are some obvious gains there (upscaling, etc) but beyond that it will likely hit the same wall.

And yet you denied above that architecture has little to do with scaling. Using AI for frame generation in order to save on traditional graphics rendering hardware is an architecture solution and one that seems more efficient than adding more traditional graphics hardware, as long as it works well.

DegustatoR said:
The way I see it there are two possibilities: 1) Nvidia don't care about their business and thus won't put any effort into improving their GPU architectures. You tell me if that sounds correct to you.
2) Nvidia with all its R&D doesn't see any obvious way of improving their shading h/w beyond what they already has. Note that they are making improvements in scheduling, data formats support, RT h/w, memory controllers, even data fetching - these come at low cost and don't show up in an apparent way in older s/w though. So saying that there are no improvements would be lying basically - it's just that they aren't bringing as high of an impact as they used to. But that's general way of how things are for the whole semi industry right now, Nvidia isn't any different to any other silicon maker in this.

I think you need to read my post again. You were the one saying that we can't expect big architectural improvements, not me. This bit of your answer seems targeted at your own comment, not mine. I never said I don't expect architecture improvements, I clearly do!

Rys · Jan 30, 2025

Any idea that there’s little “architectural” performance scaling in the arithmetic core of the GPU any more, to the point where it’s dying or dead, and we’ve hit a scaling wall primarily solvable only by manufacturing process alone, is completely and utterly false. Not demonstrating any is a choice, not because there’s some fundamental limit.

trinibwoy · Jan 30, 2025

DegustatoR said:
Not really. You can't expect designs to improve by as much with each generation since Maxwell was very much a one off thing which was never repeated again to the same degree by any vendor.

Well it hasn’t happened again because new generations usually get new nodes. Blackwell was a rare opportunity for Nvidia to flex their architecture chops on an old node and they didn’t.

Albuquerque · Jan 30, 2025

MOD MODE: No, we aren't doing personal swipes. If what you have to say is about the person, then don't say it. I'm handing out infractions for those who do. Discuss the topic at hand, not the other responders.

raytracingfan · Jan 30, 2025

RobertR1 said:
A 4090 doesn’t power scale as well.

The actual 4090 that was released doesn't scale as well through overclocking alone, but I think it's plausible that at some point Nvidia considered a higher TDP 4090 or 4090 Ti that not only had higher clocks, but also a fully-enabled AD102 die, and maybe even extra-fast 24Gbps GDDR6X.

Albuquerque · Jan 30, 2025

trinibwoy said:
Well it hasn’t happened again because new generations usually get new nodes. Blackwell was a rare opportunity for Nvidia to flex their architecture chops on an old node and they didn’t.

From a purely "chip and/or transistor architecture" level, I'm onboard for this take. I feel it's still worth reminding folks the 5090 outpaces the 4090 by more than the transistor and power budget when targeting extreme workloads; consumers who are running 1080p and 1440p might as well just skip this generation halo purchase. Also, if you couldn't possibly care less about raytracing or AI or just raw GPU compute, it's also probably not worth the purchase either. I think all of us understand the market for these is pretty small, and certainly that tiny market stems from it not being of much use for anyone outside of the above-stated narrow use cases.

Quite separately, I hope we can all agree NVIDIA did a pretty solid flex from their cooling and PCB engineering teams in this go-round. This might be considered a tacit recognition of the overall chip design not wowing the critics, and it also might be a nod to a future where they will need to bleed even more power to keep the performance moving in ever-upwards direction.

Nvidia Blackwell Architecture Speculation

trinibwoy

Meh

RobertR1

Pro

Rys

Graphics @ AMD

Subtlesnake

trinibwoy

Meh

DegustatoR

trinibwoy

Meh

Picao84

techuse

DegustatoR

Picao84

DegustatoR

techuse

Picao84

Picao84

Rys

Graphics @ AMD

trinibwoy

Meh

Albuquerque

Red-headed step child

raytracingfan

Albuquerque

Red-headed step child

GPU	Process	Memory	Transistors (Billions)	Transistor Count Gain	Performance Gain	Performance/Transistor Count Gain
GTX 770/GK104	28nm	GDDR5	3.54
GTX 980/GM204	28nm	GDDR5	5.2	1.468927	1.52	1.034769	cache increase
GTX 1080/GP104	16nm	GDDR5X	7.2	1.384615	1.67	1.206111
RTX 2070/TU106	12nm (16nm+)	GDDR6	10.8	1.5	1.15	0.766667	Tensor RT
RTX 3070ti/GA104	8nm (Samsung 12nm+)	GDDR6X	17.4	1.611111	1.64	1.017931
RTX 4080Super/AD103	5nm	GDDR6X	45.9	2.637931	1.79	0.678562	cache increase
RTX 5080/GB203	5nm	GDDR7	45.6	0.993464	1.1	1.107237