Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
I wonder if the rumors of a 600W TDP RTX 4090 were actually true, but Nvidia decided to dial it down because they already knew Blackwell wouldn't have any performance-per-watt gain over Lovelace and they needed room for a performance boost through TDP boost with the 5090.
 
So I made this table to look at how much raw performance was gained gen on gen relative to the spend on transistors. Hopfully it might give some perspective on what architectural gains per gen and where it comes from.

GPUProcessMemoryTransistors (Billions)Transistor Count GainPerformance GainPerformance/Transistor Count Gain
GTX 770/GK10428nmGDDR5
3.54​
GTX 980/GM20428nmGDDR5
5.2​
1.468927​
1.52​
1.034769​
cache increase
GTX 1080/GP10416nmGDDR5X
7.2​
1.384615​
1.67​
1.206111​
RTX 2070/TU10612nm (16nm+)GDDR6
10.8​
1.5​
1.15​
0.766667​
Tensor RT
RTX 3070ti/GA1048nm (Samsung 12nm+)GDDR6X
17.4​
1.611111​
1.64​
1.017931​
RTX 4080Super/AD1035nmGDDR6X
45.9​
2.637931​
1.79​
0.678562​
cache increase
RTX 5080/GB2035nmGDDR7
45.6​
0.993464​
1.1​
1.107237​

The above table is just a rough look at transistor spending impacted. I tried to pick the fullest implemention as possible for the GPU dies mentioned and tried for memory width consistency.

The performance numbers were taken from TPUs review for each product and using the aggregate numbers for 1440p over the previous generation. I've mentioned the process and memory technologies as those should be factored in. I've also put notes on the 2 gens where you likely had a lot of tranistors spent on cache and new functionality.

You can take from it what you will but I'm guessing it's going to depend on what you feel the reality is in terms of the actual cost per transistor today and going forward. As well as the state of process node gains going forward.

Edit: Just to add since this was asked I picked those 04/03 dies as they had fully implemented consumer products. Since we are looking at transistor counts this isolates for utlization unlike say going with the 02 dies which are all cut to varying ratios.
 
Last edited:
But would you been happy with higher prices but more performance on the same node?

I went into this on a previous post but I think people have the wrong historical impression of improvements on the same node. If we look at the previous 2 times in Turing and Maxwell in which Nvidia stayed on the same node they also increased transistor counts significantly in both cases. While with Blackwell the only die that looks like it will increase tranistor count to any meaningful degree is GB202. The rest of the stack seems to be same and being able to extract 10% (and some other improvements) does seem fairly in line with what happended with Kepler->Maxwell.

It's worth mentioning that better performance per transistor spent isn't a given either. There are examples of regressions.
It's fascinating how much better Maxwell was than Kepler on the same node. The GTX980 (GM204) is ~70% faster than GTX680 (GK104), and the 980Ti (GM200) is like 60% better than the 780 (GK110). And I think the 980 used less power than the 680 😲 and GM204 was only 30% bigger than GK104.
 
While improved, it still has noticeable artifacts, ghosting, etc. I tested it in cyberpunk and it took me less than 10 seconds to spot some issues.
Again, native TAA will give you these same artifacts but double them at worse performance. You also could be testing this thing wrong.
By the time this is relevant and has achieved mass adoption, RTX 7000 series or 8000 series will be out.
Disagree, these are seeing initial availability from the first month. I expect accelerated adoption for these two in particular.
DirectX means long lead times/uptake. Until all GPU vendors can support this, it's going nowhere fast.
DXR has proven to be the exception of that rule. Games and engines started adopting it with the support of only one vendor, and with the introduction of a fresh architecture with zero install base. Contrast that with today: all of the new features introduced with Blackwell (including Neural Rendering) will benefit from a huge install base of all the RTX GPUs out there.

Also Neural Rendering has many sub elements already deployed in the field: like Neural Radiance Cache, Neural Skin.
So I made this table to look at how much raw performance was gained gen on gen relative to the spend on transistors.
Why not make it with the biggest dies?
 
Last edited:
But would you been happy with higher prices but more performance on the same node?

For me personally I don’t know about happy but I would be “happier”. I’m less sensitive to price than I am to performance & features & experience. I’m not looking to get ripped off but at the end of the day the purchase is a one time transaction but ownership is an extended experience.

So a $2500 5090 that’s 50% faster than the 4090 is more exciting to me than a $2000 5090 that’s only 30% faster.

I went into this on a previous post but I think people have the wrong historical impression of improvements on the same node.

Yeah it was definitely wishful thinking hoping for an architecture renaissance that would bear fruit on the same node. Maybe the hardware is there and software will catch up one day to justify Blackwell’s existence.
 
Why not make it with the biggest dies?

Those dies I picked all had full implementations in terms of consumer products.

The problem with comparing the largest ones is that in terms of consumer products they were mostly cut and to varying ratios. This makes comparison tricky if we are looking at transistor counts due to what would effectively be varying utlilization rates.

Now admitedly that table doesn't account for power usage, and normalizing for that would be problematic as well. But at the least we try to isolate as much as possible.
 
For me personally I don’t know about happy but I would be “happier”. I’m less sensitive to price than I am to performance & features & experience. I’m not looking to get ripped off but at the end of the day the purchase is a one time transaction but ownership is an extended experience.

So a $2500 5090 that’s 50% faster than the 4090 is more exciting to me than a $2000 5090 that’s only 30% faster.



Yeah it was definitely wishful thinking hoping for an architecture renaissance that would bear fruit on the same node. Maybe the hardware is there and software will catch up one day to justify Blackwell’s existence.

I was referring to GB203 and the RTX 5080. I'm guessing hypothetically let's say give it another 15% to match the RTX 4090 and 24GB, but price it at $1400.

GB202 is likely against reticle size. I'd assume the only ways to signficantly improve performance would be to either have moved to a new node or maybe chiplets via advanced packaging. Either of which I can only imagine how much of that would be passed on interms of price.
 
While improved, it still has noticeable artifacts, ghosting, etc. I tested it in cyberpunk and it took me less than 10 seconds to spot some issues.

By the time this is relevant and has achieved mass adoption, RTX 7000 series or 8000 series will be out.

DirectX means long lead times/uptake. Until all GPU vendors can support this, it's going nowhere fast.

It's a very disappointing generation and I say this as someone who was initially hyped for to potentially upgrade to the 5090. After I saw the combination of $2000usd, 575 Watts and 33% performance improvement over the 4090, I was out. I'll just hold on to my 4080 super and wait till the next node shrink.
It’s a good thing too because the stock is shockingly bad - like about a tenth of the 4090 at launch based on the MC numbers.
 
Also it’s
So a $2500 5090 that’s 50% faster than the 4090 is more exciting to me than a $2000 5090 that’s only 30% faster.
The 5090 is effectively $2500, or close to it. Nearly all the cards are priced well about the FE which is always hard to get (and for this launch virtually nothing is available).
 
WRT the discussion on architectural improvements, is the opinion of some here that this is an essentially solved problem?
Well they still have the best architecture on the market and they have improved it in several ways while maintaining or improving perf/watt.
The thing also seem to clock pretty high when given the power which bodes well for any possible node change down to N3 and N2 in the future.
There are no indication that this can be improved on the architectural level any further, and Blackwell suggests that it can't if anything.
 
WRT the discussion on architectural improvements, is the opinion of some here that this is an essentially solved problem?

In terms of my perspective on this issue I'm just going to clarify.

The feeling I'm getting when architectural is being brought up is that I feel some people might be under the impression that those improvements are free or even pay for themselves in terms of unit cost and just take I guess fixed R&D. This is why I looked back at the numbers and historically I'm not seeing what supports that impression. If anything looking at Blackwell it does seem to have those types of architectural improvements as you're getting a better product including more raw performance at slightly better transistor efficiency.

With that I guess what I'm saying is people might need to be specific on what they mean by architectural improvements. And that it shouldn't be be assumed that just because you get say more performance per SM per clock that comes for "free" outside of R&D.

For example if people wanted to see more architectural improvements with Blackwell what do they mean specifically and what do they hope that would have translated to in terms of the end product?

In terms of whether or not it's a solved problem no in the sense that GPUs are going to keep improving the more transistors allocated to them in design.
 
I would guess that a lot of architecture gains to be made now are trying to optimize scheduling and different bottlenecks like register pressure. I know Apple has that dynamic cache that can adjust the size of the register file as needed as shaders are running to try to alleviate register pressure. I know on the cpu side a lot of it is micro-op scheduling, and changing how ISA instructions are broken down into micro ops. That or cache cache changes. The wizardry of the actual layouts to minimize power and improve signaling I don't know anything about at the level of a CPU. Not sure I really expect. I would actually really like to see some detailed architectural breakdowns of blackwell vs the competitors to see if they're all converging towards the same ideas, or if there are really significant differences left. I imagine Nvidia will have to go chiplet at some point as well.
 
Again, native TAA will give you these same artifacts but double them at worse performance. You also could be testing this thing wrong.
Ask around, I’m a first ballot hall of fame TAA hater. The new DLSS model being better than TAA is not an accomplishment at all to me. I automatically have a negative bias against TAA and all its derivatives. Keep that in mind when discussing DLSS or any other TAA derivatives with me.
Disagree, these are seeing initial availability from the first month. I expect accelerated adoption for these two in particular.
DXR has proven to be the exception of that rule. Games and engines started adopting it with the support of only one vendor, and with the introduction of a fresh architecture with zero install base. Contrast that with today: all of the new features introduced with Blackwell (including Neural Rendering) will benefit from a huge install base of all the RTX GPUs out there.

Also Neural Rendering has many sub elements already deployed in the field: like Neural Radiance Cache, Neural Skin.
What you expect and what will actually happen are clearly not the same thing. If history is to be believed, there will be no mass adoption of neural rendering or any of these other features in the near future. You can save this comment. Keep in mind that I define mass adoption to be >50% of all new games implementing these….
 
In terms of my perspective on this issue I'm just going to clarify.

The feeling I'm getting when architectural is being brought up is that I feel some people might be under the impression that those improvements are free or even pay for themselves in terms of unit cost and just take I guess fixed R&D. This is why I looked back at the numbers and historically I'm not seeing what supports that impression. If anything looking at Blackwell it does seem to have those types of architectural improvements as you're getting a better product including more raw performance at slightly better transistor efficiency.

With that I guess what I'm saying is people might need to be specific on what they mean by architectural improvements. And that it shouldn't be be assumed that just because you get say more performance per SM per clock that comes for "free" outside of R&D.

For example if people wanted to see more architectural improvements with Blackwell what do they mean specifically and what do they hope that would have translated to in terms of the end product?

In terms of whether or not it's a solved problem no in the sense that GPUs are going to keep improving the more transistors allocated to them in design.

There are certainly many ways to approach what constitutes architectural improvements, but we could distill it down to PPW for the purposes of this discussion. I would say that has been the metric by how we have judged it for over a decade now.

I don’t know how much scope there is for improvement at this point, and it is possible that Nvidia already has the optimal approach. But there are legitimate reasons why things could be very far from optimal and Nvidia still chooses to stick with the current approach.
 
Last edited:
But there are legitimate reasons why things could be very far from optimal and Nvidia still chooses to stick with the current approach.
5080 is at the top of perf/watt charts though. If they wouldn't try to hit as high of a perf as possible with 5090 and would just fit it into the same 450W as 4090 it would also likely be there. So in that metric Blackwell actually show more advances than on pure performance.
 
There are certainly many ways to approach what constitutes architectural improvements, but we could distill it down to PPW for the purposes of this discussion. I would say that has been the metric by how we have judged it for over a decade now.

I'm assuming by PPW you mean performance per watt? If so how would you isolate this from the process node?

If we want to look at new architectures on the same node (or close to) from history using TPUs energy efficiency numbers -

GTX 680->980 at 50% is the only one that had significant PPW increases in gaming purely I guess from what you can isolate to the architecture changes itself (Maxwell did spend on cache as well).

GTX 1080->2070 (and this was still 16nm->12nm) had none.

RTX 4080 Super-> RTX 5080 has 10% (newer memory).

Outside of Nvidia, AMD with the 6600XT ->7600XT (7nm->6nm) had none.

5700XT->6900XT had 12% (likely power savings from cache).

This one is tricky because Tonga was only ever released salvaged dies for DIY (from what I remember the best ones went to Apple) with no reference. But in terms of either 7850 or 7870 to r9 285 it actually shows PPW regression, with the older architecture being 30% more efficient.

If we want to use this comparison but r9 290x references to the Fury Nano (HBM, but both still 28nm) also had 50% more PPW.
 
I was referring to GB203 and the RTX 5080. I'm guessing hypothetically let's say give it another 15% to match the RTX 4090 and 24GB, but price it at $1400.

I think that would be more interesting and not surprising to be honest as that’s what people seemed to have been expecting in terms of price & performance. There weren’t too many bets on $1000.

The 5090 is effectively $2500, or close to it. Nearly all the cards are priced well about the FE which is always hard to get (and for this launch virtually nothing is available).

And a $2500 MSRP would mean actual pricing well north of that.
 
Back
Top