Nvidia Blackwell Architecture Speculation

DavidGraham · Jan 15, 2025

Stand outs:

In direct comparisons with previous-generation models, the RTX 5090 shows 30-40% performance increase without DLSS 4

With DLSS on, gamers not only enjoy improved frame rates but also reduced latency, when considering absolute milliseconds

Usage statistics indicate that over 80% of RTX players enable DLSS during gameplay, with a cumulative total of 3 billion hours of DLSS-enabled gaming

Double RT throughput on ray tracing units

The Shader Executive Reordering introduced with Ada Lovelace is now intended to be done twice as fast at Blackwell as quickly as with the predecessor

cores include a triangle cluster intersection engine designed specifically for handling mega geometry

Ray Tracing on Blackwell is expected to require 25 percent less graphics card memory than Ada Lovelace

DegustatoR · Jan 15, 2025

Can't say that this "tech deep dive" is very "tech" or "deep". A proper whitepaper would be nice.
Also +30-40% seem to be confirmed now. 5080 will likely end up being slower than 4090 in direct comparisons - unless the new DLSS model would run a lot better on Blackwell leading to some unexpected gains there.

Dampf · Jan 15, 2025

Do we know if Blackwell has Level 5 RT now?

Level 5 – Coherent BVH Processing with Scene Hierarchy Generator in Hardware.

Scott_Arm · Jan 15, 2025

@Man from Atlantis That's a pretty big change for the SM. Really looking forward to reading this. Curious of the int32 increase is going to mean anything interesting for performance. My expectation is it won't.

Samwell · Jan 15, 2025

DegustatoR said:
Can't say that this "tech deep dive" is very "tech" or "deep". A proper whitepaper would be nice.
Also +30-40% seem to be confirmed now. 5080 will likely end up being slower than 4090 in direct comparisons - unless the new DLSS model would run a lot better on Blackwell leading to some unexpected gains there.

I don't think we get one. Just look at Datacenter Blackwell and the terribly presentation at hot chips. It's just a basic architecture overview. Nvidia stopped showing real architecute details with Blackwell. It seems like arrogance is starting to win at Nvidia everywhere.

Man from Atlantis · Jan 15, 2025

Scott_Arm said:
@Man from Atlantis That's a pretty big change for the SM. Really looking forward to reading this. Curious of the int32 increase is going to mean anything interesting for performance. My expectation is it won't.

Back to Maxwell FP/INT with the ability for concurrent tensor operations. :runaway:

trinibwoy · Jan 15, 2025

Scott_Arm said:
@Man from Atlantis That's a pretty big change for the SM. Really looking forward to reading this. Curious of the int32 increase is going to mean anything interesting for performance. My expectation is it won't.

Well Nvidia made a big deal out of saying that INT32 isn’t as important as FP32 to graphics performance. No idea why they’re reversing course now. Maybe it helps more with AI workloads?

DegustatoR · Jan 15, 2025

trinibwoy said:
Well Nvidia made a big deal out of saying that INT32 isn’t as important as FP32 to graphics importance. No idea why they’re reversing course now. Maybe it helps more with AI workloads?

Maybe it's not two SIMD16 now but just one SIMD32 just like in Maxwell?

trinibwoy · Jan 15, 2025

DegustatoR said:
Can't say that this "tech deep dive" is very "tech" or "deep". A proper whitepaper would be nice.
Also +30-40% seem to be confirmed now. 5080 will likely end up being slower than 4090 in direct comparisons - unless the new DLSS model would run a lot better on Blackwell leading to some unexpected gains there.

Nvidia’s white papers don’t go into much more detail than we’re seeing on these slides. Their CUDA programming guides are slightly better but they haven’t released one for Blackwell.

Anyone else find it strange that there’s zero mention of L2 cache size and capability after they hyped up Ada L2? This really feels like a coasting generation hardware wise. All the fun stuff is in software.

Scott_Arm · Jan 15, 2025

trinibwoy said:
Well Nvidia made a big deal out of saying that INT32 isn’t as important as FP32 to graphics performance. No idea why they’re reversing course now. Maybe it helps more with AI workloads?

Kind of wondering if it has something to do with simplifying the SM to improve scheduling of work. You have shader execution reordering and this AMP processor to help schedule ai workloads, and maybe it's just easier to do all of this if the compute resources are all the same.

DavidGraham · Jan 15, 2025

TopSpoiler said:
https://twitter.com/x/status/1879559680059752677

So consumer Ada and Blackwell are manufactured on the same 4N node .. explains the modest increases in performance. Not going for N4P (+22% energy efficiency and +6% density) is a strategic choice it seems, as B200 is manufactured on N4P. NVIDIA clearly prioritized the best node for their best data center chip.

Man from Atlantis · Jan 15, 2025

trinibwoy said:
Nvidia’s white papers don’t go into much more detail than we’re seeing on these slides. Their CUDA programming guides are slightly better but they haven’t released one for Blackwell.

Anyone else find it strange that there’s zero mention of L2 cache size and capability after they hyped up Ada L2? This really feels like a coasting generation hardware wise. All the fun stuff is in software.

Judging by the die sizes of GB203/205 in comparison to AD103/104, it seems Nvidia has kept the same L2 cache size.

trinibwoy · Jan 15, 2025

Man from Atlantis said:
Judging by the die sizes of GB203/205 in comparison to AD103/104, it seems Nvidia has kept the same L2 cache size.

If those die sizes are accurate then Nvidia somehow managed to cram fatter SMs and 4 more of them into the same area. I would be surprised if they didn’t sacrifice L2 to make room and lean more heavily on GDDR7.

fellix · Jan 15, 2025

DegustatoR said:
Maybe it's not two SIMD16 now but just one SIMD32 just like in Maxwell?

Wouldn't that result in serialized FP/INT execution?

DegustatoR · Jan 15, 2025

fellix said:
Wouldn't that result in serialized FP/INT execution?

Sure but why would that be bad if said execution would take one cycle instead of two?

troyan · Jan 15, 2025

fellix said:
Wouldn't that result in serialized FP/INT execution?

nVidia said to the press that FP/INT can run concurrently. But no information about the ratio.

DegustatoR · Jan 15, 2025

troyan said:
nVidia said to the press that FP/INT can run concurrently. But no information about the ratio.

Honestly this could mean a lot of things. Concurrently on a chip in the same clock? On an SM which has several SIMDs for that? On one SM partition out of four which they've shown as "SM" in Blackwell to Ada comparison for some reason? Or maybe they are talking about warps which are "in flight" on an SM?

Dangerman · Jan 15, 2025

DavidGraham said:
So consumer Ada and Blackwell are manufactured on the same 4N node .. explains the modest increases in performance. Not going for N4P (+22% energy efficiency and +6% density) is a strategic choice it seems, as B200 is manufactured on N4P. NVIDIA clearly prioritized the best node for their best data center chip.

Makes me hope and expect RTX 50 will be a short lived gen with no Super 3GB Module refresh and move to RTX 60 Q4 of next year basically a Maxwell to Pascal life cycle (Sep 2024 > May 2026 which was 20 months). Especially if AMD tries to go with UDNA 1st Gen earlier next year (Say Computex annoucement?). I mean 3GB Modules mean 15 (if they 160-bit cut down buses)-24GB becomes the "norm" for gamers.

homerdog · Jan 15, 2025

Dangerman said:
Makes me hope and expect RTX 50 will be a short lived gen with no Super 3GB Module refresh and move to RTX 60 Q4 of next year basically a Maxwell to Pascal life cycle (Sep 2024 > May 2026 which was 20 months). Especially if AMD tries to go with UDNA 1st Gen earlier next year (Say Computex annoucement?). I mean 3GB Modules mean 15 (if they 160-bit cut down buses)-24GB becomes the "norm" for gamers.

Is there a reason IHVs have mostly avoided odd numbers of memory channels? Are they better in pairs or something?

arandomguy · Jan 15, 2025

Not that I know offrom what's been publicly disclosed.

But IHVs also don't often cut down memory channels in terms of product configuration. I'd wager it's sometimes done here just for some product segmentation as the likely a large factor.

If you mean in design I'd assume it's likely simpler and more optimal from a layout perspective given where the memory controllers/PHYs are located and spaced around the edges of the chip.

This is speculative but if you look at Nvidia's creator slide for RTX 50xx it kind of gives some insight in terms of where certain memory cut off points are going to be for segmentation reasons.

Nvidia Blackwell Architecture Speculation

DavidGraham

DegustatoR

Dampf

Scott_Arm

Samwell

Man from Atlantis

idk

trinibwoy

Meh

DegustatoR

trinibwoy

Meh

Scott_Arm

DavidGraham

Man from Atlantis

idk

trinibwoy

Meh

fellix

DegustatoR

troyan

DegustatoR

Dangerman

homerdog

donator of the year

arandomguy