Nvidia Blackwell Architecture Speculation

DegustatoR · Jan 10, 2025

CarstenS said:
I don't know of any new docs. FP32 does not need to double - and I am not aware of such a thing. But in Ada, only one of each SIMD16 pairs was combined INT32/FP32, the other one was FP32-only.

Makes sense I guess.

Scott_Arm · Jan 10, 2025

I would guess they’re disabled as part of the manufacturing process so they can’t be repaired or enabled.

Scott_Arm · Jan 10, 2025

orangpelupa said:
That's just the usual pass thru cooling right? A bliss for cases like fractal ridge, but a nightmare for sandwich cases like nr200p

Both fans are pass through, not just the one at the end. The pcb sits between the fans and there are daughter boards for the pcie connector and the video connectors that are designed around the fans.

homerdog · Jan 10, 2025

That is a tiny PCB, it's amazing they can cool 575W from such a thing. Really impressive design. I guess making the traces shorter helps it go fast as well.

DegustatoR · Jan 10, 2025

homerdog said:
it's amazing they can cool 575W from such a thing

On the opposite it may well be a requirement to cool it with a small 2-slot cooler as it allows both fans to blow through the fin stacks.
AIB models so far seem to be your usual 3-4 slot monstrosities.

trinibwoy · Jan 10, 2025

arandomguy said:
I'm guessing the evolution is -

Pascal-> FP32/INT32

Turing -> FP32 + INT32

Ampere (Ada)-> FP32 + FP32/INT32

Blackwell-> FP32/INT32 + FP32/INT32

So Pascal got it right the first time around? Not sure what difference there is between 1 32-wide SIMD that can do FP32/INT32 and 2 16-wide SIMDs that can do FP32/INT32.

DegustatoR said:
On the opposite it may well be a requirement to cool it with a small 2-slot cooler as it allows both fans to blow through the fin stacks.
AIB models so far seem to be your usual 3-4 slot monstrosities.

If the FE fans are quiet I will be extremely impressed. It's going to depend on case ventilation. It'll need a lot of cool air.

homerdog · Jan 10, 2025

DegustatoR said:
On the opposite it may well be a requirement to cool it with a small 2-slot cooler as it allows both fans to blow through the fin stacks.
AIB models so far seem to be your usual 3-4 slot monstrosities.

Do you think the AIB models will be more effective? TBH I don't know if it's safe for the card to draw >600W unless it has 2 power connectors, so assuming NVIDIA's cooler is sufficient for 575W I don't see what purpose making it huge would serve. I get freaked out having such a heavy thing hanging off my PCIe slot, especially when transporting my PC.

DegustatoR · Jan 10, 2025

homerdog said:
Do you think the AIB models will be more effective?

They might be. We'll have to wait for reviews to know that.

pcchen · Jan 10, 2025

trinibwoy said:
So Pascal got it right the first time around? Not sure what difference there is between 1 32-wide SIMD that can do FP32/INT32 and 2 16-wide SIMDs that can do FP32/INT32.

I'm not sure how it's actually arranged internally, but probably in a VLIW fashion (i.e. an instruction word contains two instructions). Longer SIMD requires more global parallelism, while shorter SIMD requires more local paralellism. If the cost of a second SIMD unit is low enough that's likely a win.

Scott_Arm · Jan 10, 2025

homerdog said:
Do you think the AIB models will be more effective? TBH I don't know if it's safe for the card to draw >600W unless it has 2 power connectors, so assuming NVIDIA's cooler is sufficient for 575W I don't see what purpose making it huge would serve. I get freaked out having such a heavy thing hanging off my PCIe slot, especially when transporting my PC.

I can't remember what the 12v connector is called, but I think it handles 600W on its own, and then you have the pcie5.0 power delivery. Honestly, I think dual flow-through should probably cool a lot better. It's basically designed like a 240mm radiator, but with a small gpu pcb stuck directly in the middle. We'll have to wait and see.

There are a lot of new cases that have fans that pull in air from the bottom, which is perfect for this flow through gpu design. I'd stick an aio rad for the cpu in the front of the case as an intake, have bottom intake fans for the gpu and then have exhaust fans in the back and top.

Arun · Jan 10, 2025

arandomguy said:
I'm guessing the evolution is -

Pascal-> FP32/INT32

Turing -> FP32 + INT32

Ampere (Ada)-> FP32 + FP32/INT32

Blackwell-> FP32/INT32 + FP32/INT32

Unfortunately, the FP32 / INT32 split is an oversimplification of how the HW actually works. There are a bunch of execution ports and on GA102 & newer HW, the typical operations are really split into 3 pipelines: FMAHeavy (FP32 FMA/FMUL/FADD and 32x32 integer multiply-add), FMALite (FP32 FMA/FMUL/FADD only), and INT (integer add, logic ops, *and* floating point comparison including FP32 min/max etc...). Each of the 4 SM sub-cores can only issue 1 instruction per clock (with each of those execution pipelines taking 2 cycles minimum to execute any instruction, you can never issue to the same pipeline two cycles in a row given the 16-wide ALUs with 32-wide warps) so it is impossible to saturate all 3 pipelines simultaneously.

The HW is limited by schedule/fetch/decode/issue logic of 1/clk before even considering that this 1/clk also includes all control/branching/special function/memory instructions... of course, in practice, you might be limited by register file bandwidth or other bottlenecks before being limited by the 1/clk rate, but I believe that rate is indeed a fairly common bottleneck.

It's impossible to say exactly what NVIDIA has done here before we get access to the SASS disassembly & can write microbenchmarks, but for example there's no good reason for them to double the 32x32 multiply-add pipeline; actually now that they have the TMA, even 1/2 rate 32x32 IMAD feels very excessive to me (AMD is 1/4 rate iirc) but maybe it still makes sense to keep it for the majority of workloads that do not use TMA...

Doubling the "integer" pipe (which really isn't just integer - e.g. FP32 min/max) would be nice, but what I really want to see is an increase to the instruction fetch/decode rate, and ideally something bettere than their extremely low density 128-bit fixed length ISA ... Hopper has proven low code density doesn't really matter for matrix multiplication & deep learning especially if you just use much longer asynchronous TMA & WGMMA instructions (instruction decoder is actually idle a lot of the time for Hopper matrix multiplication kernels!) but it's still an obvious weakness of the architecture, including for things like raytracing where instruction cache misses are common and NVIDIA's very aggressive prefetching won't work as well (or might even have to be disabled completely for certain workloads afaik).

GhostofWar · Jan 10, 2025

Maybe Gamers Nexus will get a new interview with that Nvidia guy they did the 2 videos with for the last cooler. They were some fascinating videos and he was very open about it all.

troyan · Jan 10, 2025

Maybe someone wants to measure the die size of GB203:

https://twitter.com/x/status/1876802378650775733

Man from Atlantis · Jan 10, 2025

troyan said:
Maybe someone wants to measure the die size of GB203:

https://twitter.com/x/status/1876802378650775733

quick autocad says 395-422mm2

techuse · Jan 10, 2025

Arun said:
Unfortunately, the FP32 / INT32 split is an oversimplification of how the HW actually works. There are a bunch of execution ports and on GA102 & newer HW, the typical operations are really split into 3 pipelines: FMAHeavy (FP32 FMA/FMUL/FADD and 32x32 integer multiply-add), FMALite (FP32 FMA/FMUL/FADD only), and INT (integer add, logic ops, *and* floating point comparison including FP32 min/max etc...). Each of the 4 SM sub-cores can only issue 1 instruction per clock (with each of those execution pipelines taking 2 cycles minimum to execute any instruction, you can never issue to the same pipeline two cycles in a row given the 16-wide ALUs with 32-wide warps) so it is impossible to saturate all 3 pipelines simultaneously.

The HW is limited by schedule/fetch/decode/issue logic of 1/clk before even considering that this 1/clk also includes all control/branching/special function/memory instructions... of course, in practice, you might be limited by register file bandwidth or other bottlenecks before being limited by the 1/clk rate, but I believe that rate is indeed a fairly common bottleneck.

It's impossible to say exactly what NVIDIA has done here before we get access to the SASS disassembly & can write microbenchmarks, but for example there's no good reason for them to double the 32x32 multiply-add pipeline; actually now that they have the TMA, even 1/2 rate 32x32 IMAD feels very excessive to me (AMD is 1/4 rate iirc) but maybe it still makes sense to keep it for the majority of workloads that do not use TMA...

Doubling the "integer" pipe (which really isn't just integer - e.g. FP32 min/max) would be nice, but what I really want to see is an increase to the instruction fetch/decode rate, and ideally something bettere than their extremely low density 128-bit fixed length ISA ... Hopper has proven low code density doesn't really matter for matrix multiplication & deep learning especially if you just use much longer asynchronous TMA & WGMMA instructions (instruction decoder is actually idle a lot of the time for Hopper matrix multiplication kernels!) but it's still an obvious weakness of the architecture, including for things like raytracing where instruction cache misses are common and NVIDIA's very aggressive prefetching won't work as well (or might even have to be disabled completely for certain workloads afaik).

When was the last architecture in which instruction issue and the ISA was substantially improved? Turing?

Arun · Jan 11, 2025

techuse said:
When was the last architecture in which instruction issue and the ISA was substantially improved? Turing?

Volta was the last "completely new ISA" - every generation since then has mostly "just" been (significant!) additions on top of the same baseline ISA, e.g. Turing uniform registers is a pretty big change but the existing Volta instructions didn't change much. NVIDIA doesn't explain what any SASS instruction actually does, but they do list them in their entirety(?) on this page, and you can see how Hopper is roughly a superset of Volta: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

NVIDIA only created a new ISA from ~scratch 3 times in the CUDA era: G80, Fermi, and Volta. But Kepler & Maxwell were major ISA changes too (arguably bigger than Turing/Hopper although it's hard to tell from the outside how much of the microarchitecture was redesigned for any of those). There's this great page with the raw ISA encodings for H100 and AD102 which is interesting as well: https://github.com/kuterd/nv_isa_solver?tab=readme-ov-file

I talk (a little bit) about this in my GPU MODE lecture on SASS & GPU Microarchitecture:

Malo · Jan 11, 2025

Scott_Arm said:
I can't remember what the 12v connector is called, but I think it handles 600W on its own, and then you have the pcie5.0 power delivery. Honestly, I think dual flow-through should probably cool a lot better. It's basically designed like a 240mm radiator, but with a small gpu pcb stuck directly in the middle. We'll have to wait and see.

There are a lot of new cases that have fans that pull in air from the bottom, which is perfect for this flow through gpu design. I'd stick an aio rad for the cpu in the front of the case as an intake, have bottom intake fans for the gpu and then have exhaust fans in the back and top.

View attachment 12826

Aren't the PSU shrouds usually on the bottom? Where is the PSU going with this flow? Moved up front or really wide case with side mount?

Scott_Arm · Jan 11, 2025

Malo said:
Aren't the PSU shrouds usually on the bottom? Where is the PSU going with this flow? Moved up front or really wide case with side mount?

These dual chamber cases have become the most popular style and you can normally put fans in the bottom.

H6 Flow | Gaming PC Cases | NZXT

NZXT Compact Dual-Chamber Mid-Tower Airflow Case

nzxt.com

Albuquerque · Jan 11, 2025

Yeah, my newest case is this same dual chamber design. It seems like a bit of wasted space to me buuuuut it does keep the airflow cleaner.

DavidGraham · Jan 11, 2025

Live comparison between DLSS CNN model vs DLSS Transformer model.

Nvidia Blackwell Architecture Speculation

DegustatoR

Scott_Arm

Scott_Arm

homerdog

donator of the year

DegustatoR

trinibwoy

Meh

homerdog

donator of the year

DegustatoR

pcchen

Moderator

Scott_Arm

Arun

Unknown.

GhostofWar

troyan

Man from Atlantis

idk

techuse

Arun

Unknown.

Malo

Yak Mechanicum

Scott_Arm

H6 Flow | Gaming PC Cases | NZXT

Albuquerque

Red-headed step child

DavidGraham