Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
When we asked how DLSS 4 multi frame generation works and whether it was still interpolating, Jensen boldly proclaimed that DLSS 4 "predicts the future" rather than "interpolating the past." That drastically changes how it works, what it requires in terms of hardware capabilities, and what we can expect in terms of latency.
Extrapolation?
 
That’s an interesting article. It also suggests only 50 series can run neural shaders, which would be a death knell for feature adoption
It's why I'm strongly considering not getting a 5090 and instead waiting for 60 Series. It will be years before these features are broadly adopted in games (if they ever do) and the 4090 and 5090 will still remain overkill for the rest of this generation for VRAM.. so there's no real necessity.
 
That’s an interesting article. It also suggests only 50 series can run neural shaders, which would be a death knell for feature adoption
That makes sense though. Neural shaders with forward shading means you need a proper interface to the tensor cores that is suitable for calling it from just groups of 4 threads, not full warps.

Older Nvidia generations would not be able to use the tensor cores properly from a fragment shader.

Funny though, RDNA should not have any issues with neural shaders at all, if anything Blackwell should behave much closer to it now.
 
hopefully there's a decent whitepaper at some point, but I'm looking forward to see the launch details that media people are able to get their hands on.
 
Yeah, well, they could've done shading and tensor operations "at the same time" since Ampere. It is usually impossible due to register file and memory bandwidth though, not because of how the SM is designed. So it is unclear what has changed in Blackwell. Maybe they've moved the tensor ALUs inside main shading SIMDs and they all are controlled by the same logic now? Would be wild as that's seem to be how AMD has implemented AI h/w in RDNA4, and also would actually be a step backwards from how tensor h/w was built into Nvidia GPUs since Volta.
H100 introduced asynchronous warpgroup MMA. A single thread within a warpgroup (= 4 warps, 128 threads) submits a batch of tensor operations. The tensor cores then pull data out of shared memory (and, optionally, registers) and write the result to registers across the whole warpgroup. This happens asynchronously, threads can go on doing other work until they require the MMA results, while the tensor cores use any "spare" shared memory and register bandwidth to do their work. I would expect Blackwell to adopt (in fact, to expand) this model.
 
That makes sense though. Neural shaders with forward shading means you need a proper interface to the tensor cores that is suitable for calling it from just groups of 4 threads, not full warps.

Older Nvidia generations would not be able to use the tensor cores properly from a fragment shader.

Funny though, RDNA should not have any issues with neural shaders at all, if anything Blackwell should behave much closer to it now.
Well strictly speaking nothing stops Nvidia from running such workloads in the same way RDNA does it right now. It would probably be too slow to be usable though. The convergence between Blackwell and RDNA is happening with RDNA4 specifically which isn't exactly generic "RDNA" and means the exact same thing for "neural shaders" compatibility on AMD as it does for Nvidia.
 
The more you look into the specs, the more obvious it is that Nvidia is trying to pull a fast one..... Comparing the 4070 super to the 5070, the 4070 super has less bandwidth but 17% more cuda cores, 17% more rt cores...... Other than mfg which isn't looking too hot at the moment with all the artifacts that it has, will the 5070 even beat the 4070 super? I'm not so sure.... I'm, not so sure... The 5080 also looks to be barely faster than a 4080 super if we just look at the specs. The same also appears to be true when comparing the 5070ti to the 4070ti super.... This might just be the worst generational increase I've ever seen. It's no surprise that it comes as a result of no competition. We'll see what the real world non mfg performance looks like shortly.. I personally can't wait.
 
In these Farcry 6 benchmarks from nVidia Blackwell is ~33% faster than the non super Lovelace cards. A 5070TI would be ~10% slower than a 4090:
 
That 27% and 40 something percent uplift for far cry and a plague tale tells me that scaling outside of node reductions is dead and buried.

The 5090 consumes like 25% more energy and has 70% percent more memory bandwidth and that's the percentage improvement?
 
5070 is beating 4070Ti in Nvidia provided benchmarks.


Which is why nobody should look at the specs to figure out the performance.
Yea, let's wait for independent benchmarks. The Nvidia provided benchmarks are filled with caveats and often compare unlike things. Even Intel provides better benchmarks.. For those of us not interested in mfg, there's certainly reason for concern.... Especially as it relates to real performance gains in RT and Raster when compared to the super line..

That 27% and 40 something percent uplift for far cry and a plague tale tells me that scaling outside of node reductions is dead and buried.

The 5090 consumes like 25% more energy and has 70% percent more memory bandwidth and that's the percentage improvement?
That's for the 5090 and it has significantly more cuda cores, more bandwidth, more rt cores, etc. For the other gpus, when compared to super line, there is barely any improvement in base specs, clock speeds, etc. Very suspect benchmarks released by Nvidia... Very suspect.
 
nVidia claims that their notebook versions are 40% more efficient. And they use basically the same configuration outside of GDDR7.
"For GeForce RTX 50 Series laptops, new Max-Q technologies such as Advanced Power Gating, Low Latency Sleep, and Accelerated Frequency Switching increases battery life by up to 40%, compared to the previous generation."

Up to 40%. Don't have much hope for energy efficiency, wouldn't be surprised if with same core count and the same frequency they are very close.
 
The 5090 consumes like 25% more energy and has 70% percent more memory bandwidth and that's the percentage improvement?
FC6 is most definitely CPU limited on the 5090 since it shows a higher gain on 5080 vs 4080 which makes zero sense otherwise.
APTR is a more GPU limited game so +40% is the more likely average result for 5090 vs 4090. Considering that we're looking at +30% or so FP32 change between 4090 and 5090 this seems like a solid enough gain really.
 
GB202 in the RTX 5090 uses 16x 2 GB Samsung modules. The GPU entered production in the week of September 17 2024.
The Nvidia Drive AGX Thor (right) uses Blackwell GPU with an Arm Neoverse V3AE CPU and Micron memory.

Ggx007ka0AAkDmt.jpgGgx007la4AEFnte.jpg

Also, we might learn more about architectural details after tonight's deep dive.

 
Back
Top