Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
It’s wild that counter strike saw the biggest increase though. So not bad but not amazing given the price increase. More than enough for an upgrade from my 3090.

This is probably not that surprising because games with relatively simple shaders like Counter Strike are most likely to be memory limited. Games with more complex shaders are more likely to be computation limited.

We already know that 5090 only has ~30% more computation power, so it's within expectation. Obviously 5090 has more tensor core computation power (mostly due to FP4 support) so it's probably going to be more powerful running AI algorithms. So, as I said before, if you already have a 4090 it's probably not really worth it upgrading to a 5090 unless you can sell your 4090 for a nice price or you have another computer which can use your old 4090. On the other hand, if you want to bet on the "AI future" of rendering then maybe 5090 is not a bad idea, but since these things all take time to develop, I guess it's probably going to take a few years before these techs are relatively common in games, and there'll be better GPU released at that time.

As for performance per watt, it's also not surprising because the process nodes are basically the same (5090 uses a slightly better process but the difference is probably within 10%). We'll have better process in the future of course, but the advances will be less pronounced so it's probably not wise to expect too much, and that's why we'll probably see more AI rendering in the future.
 
As I wrote some time ago, this is troubling for the future, if it wasn't already clear. It costs 25% (if you can even get the FE editions that aren't available easily) to 35%+ more than a 4090 while consuming 25%~ more watts and having 25%~ more cores.

And of course, all of that for 25-35%~ more performance at 4k.

There is no efficiency uplift from the architecture itself, pretty much no perf/watt advancements. Nvidia has hit a wall. And that means that the low end will continue to suck while the high end will get minor advancements that the consumer will pay higher and higher prices for.

Outside of node reductions (and those are maybe coming to a end) the future is bleak.
sadly the advances in the efficiency area have been small. I love efficiency. And nVidia are the best at that, if you compare a RTX 4060 to a RX 7600 and a B570, the nVidia card consumes like 40-something W less than the RX 7600 and 10W to 20W less than the B570 while remaining very competitive in benchmarks.
 
The performance uplift is definitely "med", DLSS Transformer (TNN) is definitely a great win though

NVIDIA is introducing a "Transformers" upscaling model for DLSS, which is a major improvement over the previous "CNN" model. I tested Transformers and I'm in love. The image quality is so good, "Quality" looks like native, sometimes better.
There is no more flickering or low-res smeared out textures on the horizon. Thin wires are crystal clear, even at sub-4K resolution! You really have to see it for yourself to appreciate it, it's almost like magic.
The best thing? DLSS Transformers is available not only on GeForce 50, but on all GeForce RTX cards with Tensor Cores! While it comes with a roughly 10% performance hit compared to CNN, I would never go back to CNN.

 
Last edited:
TPU has a nice comparison table showing the performance of CNN and TNN models for the 5090 and 4090. As expected, the 4090 experiences a more significant performance hit when switching from CNN to TNN on DLSS Q/B/P presets.

If I'm reading this correctly, the perf hit for the new model is substantial on the 4090. 4090 goes from 78FPS to 67FPS at DLSS Quality 😲

My 4070 hopes they will get the cost down over time. But IDK if that's feasible.
 
TPU shows 35% average uplift at 4k.

Let's not mislead here.

Yep which doesn’t contradict anything I said. Wrong tree.

No improvement performance per vat, impact of rt enabling on frames similar to 4090, no wonder they pushing multiframe generation so hard.

Yeah none of the benchmarks so far show any improvement to RT efficiency. DF said plague tale actually scaled better with RT off.

Seems like Nvidia can be really glad that AMD did skip this rounds' enthusiast level chips.

This will be interesting to watch. If RDNA 4 is a significant improvement in performance/mm^2 this may be an exciting generation after all. The Turing derived architectures seem to be tapped out. What’s unclear is where the main bottlenecks lie. If it’s on the hardware side then there’s an opportunity here for AMD. If the issue is fundamentally inefficient APIs and software then AMD will run into the same challenges with a hypothetical big chip.
 
  • Like
Reactions: snc
The Turing derived architectures seem to be tapped out.
Dunno about that. Aside from some really weird results which prompt into looking into the engines to see what's going on there 5090 seem to scale as you'd expect from a GPU with +30% FLOPs and +80% of bandwidth. If anything it disproves the voices saying that 4090 was memory b/w limited. The only part which is a bit unexpected is the relatively weaker scaling in ray traced titles. With this one we'll have to see if the new features will be of (bigger) help on Blackwell though.
 
If I'm reading this correctly, the perf hit for the new model is substantial on the 4090. 4090 goes from 78FPS to 67FPS at DLSS Quality 😲

My 4070 hopes they will get the cost down over time. But IDK if that's feasible.
I think it might be better to use TNN B rather than CNN Q, not just for overall performance but also for IQ. The relative performance hit should be more significant on smaller Ada cards.
 
Dunno about that. Aside from some really weird results which prompt into looking into the engines to see what's going on there 5090 seem to scale as you'd expect from a GPU with +30% FLOPs and +80% of bandwidth. If anything it disproves the voices saying that 4090 was memory b/w limited.
I guess it can be seen both ways. The way I read into @trinibwoy 's comment is that Turing isn't getting more efficient, it's the same perf it always was, all we're left with is adding more of them. And that seems to jive with @DegustatoR 's comment about adding 30% more flops (or cores) and getting 30% more performance out. Which means Turing isn't tapped out in terms of ability to scale; it looks like we can continue adding more Turing-based cores and get a linear speed improvement out of them...

I think what we were hoping for is a more efficient evolution of Turing, somehow we could get more FLOPS per transitor or more frames per watt or something. It looks like power scaling is still good, at least according to the link @Man from Atlantis put up. The additional transistors seems to provide a double-digit increase in performance at the same power consumption level. I think this also lends credence to @DegustatoR 's point of Turing arch still doing pretty darn well in terms of scalability.
 
I think it might be better to use TNN B rather than CNN Q, not just for overall performance but also for IQ. The relative performance hit should be more significant on smaller Ada cards.
I mentioned this in the dedicated Cyberpunk 2077 thread; I noticed I got the 2.2 update this morning auto-magically (thanks, Steam!) and I also picked up the newest drivers from the NVIDIA app. I now have the ability to select TNN vs CNN DLSS modes, so I'm gonna play with those at lunch today (in an hour) and I'll report back with my personal, qualitative stance on how they stack up together.
 
I guess it can be seen both ways. The way I read into @trinibwoy 's comment is that Turing isn't getting more efficient, it's the same perf it always was, all we're left with is adding more of them. And that seems to jive with @DegustatoR 's comment about adding 30% more flops (or cores) and getting 30% more performance out. Which means Turing isn't tapped out in terms of ability to scale; it looks like we can continue adding more Turing-based cores and get a linear speed improvement out of them...

I think what we were hoping for is a more efficient evolution of Turing, somehow we could get more FLOPS per transitor or more frames per watt or something. It looks like power scaling is still good, at least according to the link @Man from Atlantis put up. The additional transistors seems to provide a double-digit increase in performance at the same power consumption level. I think this also lends credence to @DegustatoR 's point of Turing arch still doing pretty darn well in terms of scalability.

Yeah tapped out in terms of efficiency not absolute performance. The limited impact of 80% more bandwidth is a bit of a red flag.
 
The performance drop of my trusty old 2060 laptop in Cyberpunk with the transformer model is nothing to sneeze at. From 88 fps to 70 FPS in 1440p performance mode. That is one of the worst RTX cards though and an old one, so thats probably the worst case scenario. It looks very good though and I'm thankful to have the option.
 
I guess it can be seen both ways. The way I read into @trinibwoy 's comment is that Turing isn't getting more efficient, it's the same perf it always was, all we're left with is adding more of them. And that seems to jive with @DegustatoR 's comment about adding 30% more flops (or cores) and getting 30% more performance out. Which means Turing isn't tapped out in terms of ability to scale; it looks like we can continue adding more Turing-based cores and get a linear speed improvement out of them...
Turing has gotten "more efficient" twice already though. Ampere is more efficient than Turing thanks to its expansion of FP32 math. And Lovelace is more efficient than Ampere thanks to a somewhat insane production process improvement.
With Blackwell we don't get either - both the process and the architecture are essentially the same as in Lovelace (some new features are added but they are not automatic and has to be coded for). So not seeing any changes in efficiency here is to be expected. Well, I guess it is a bit unexpected that the architecture essentially remains the same. Also there's clearly zero actual real world gain from having twice the INT throughput per SM.
 
The performance drop of my trusty old 2060 laptop in Cyberpunk with the transformer model is nothing to sneeze at. From 88 fps to 70 FPS in 1440p performance mode. That is one of the worst RTX cards though and an old one, so thats probably the worst case scenario. It looks very good though and I'm thankful to have the option.

I'm hoping the dll shows up on techpowerup soon and SpecialK adds support for switching between CNN and transformer. Want to try it in a few games.
 
Turing has gotten "more efficient" twice already though. Ampere is more efficient than Turing thanks to its expansion of FP32 math. And Lovelace is more efficient than Ampere thanks to a somewhat insane production process improvement.
With Blackwell we don't get either - both the process and the architecture are essentially the same as in Lovelace (some new features are added but they are not automatic and has to be coded for). So not seeing any changes in efficiency here is to be expected. Well, I guess it is a bit unexpected that the architecture essentially remains the same. Also there's clearly zero actual real world gain from having twice the INT throughput per SM.

There are probably very selective gains that might be reliant on new features like mega geometry. Not sure what changed in the RT cores other than being able to do intersections on clusters, but that requires the game to support mega geometry. That and things like being able to schedule neural shaders more efficiently, which we won't see until a game comes out using neural shaders to actually compare 40 series to 50 series.
 
There are probably very selective gains that might be reliant on new features like mega geometry. Not sure what changed in the RT cores other than being able to do intersections on clusters, but that requires the game to support mega geometry.
Yes, and thankfully AW2 update with that should be coming soon. So we'll see if Blackwell is actually getting more from it.

That and things like being able to schedule neural shaders more efficiently, which we won't see until a game comes out using neural shaders to actually compare 40 series to 50 series.
This one is likely far off though, I wouldn't expect that feature to be used until well into the 60 series or even 70.
 
Back
Top