Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
That's just the usual pass thru cooling right? A bliss for cases like fractal ridge, but a nightmare for sandwich cases like nr200p

Both fans are pass through, not just the one at the end. The pcb sits between the fans and there are daughter boards for the pcie connector and the video connectors that are designed around the fans.
 
That is a tiny PCB, it's amazing they can cool 575W from such a thing. Really impressive design. I guess making the traces shorter helps it go fast as well.
 
I'm guessing the evolution is -

Pascal-> FP32/INT32

Turing -> FP32 + INT32

Ampere (Ada)-> FP32 + FP32/INT32

Blackwell-> FP32/INT32 + FP32/INT32

So Pascal got it right the first time around? Not sure what difference there is between 1 32-wide SIMD that can do FP32/INT32 and 2 16-wide SIMDs that can do FP32/INT32.

On the opposite it may well be a requirement to cool it with a small 2-slot cooler as it allows both fans to blow through the fin stacks.
AIB models so far seem to be your usual 3-4 slot monstrosities.

If the FE fans are quiet I will be extremely impressed. It's going to depend on case ventilation. It'll need a lot of cool air.
 
On the opposite it may well be a requirement to cool it with a small 2-slot cooler as it allows both fans to blow through the fin stacks.
AIB models so far seem to be your usual 3-4 slot monstrosities.
Do you think the AIB models will be more effective? TBH I don't know if it's safe for the card to draw >600W unless it has 2 power connectors, so assuming NVIDIA's cooler is sufficient for 575W I don't see what purpose making it huge would serve. I get freaked out having such a heavy thing hanging off my PCIe slot, especially when transporting my PC.
 
So Pascal got it right the first time around? Not sure what difference there is between 1 32-wide SIMD that can do FP32/INT32 and 2 16-wide SIMDs that can do FP32/INT32.

I'm not sure how it's actually arranged internally, but probably in a VLIW fashion (i.e. an instruction word contains two instructions). Longer SIMD requires more global parallelism, while shorter SIMD requires more local paralellism. If the cost of a second SIMD unit is low enough that's likely a win.
 
Do you think the AIB models will be more effective? TBH I don't know if it's safe for the card to draw >600W unless it has 2 power connectors, so assuming NVIDIA's cooler is sufficient for 575W I don't see what purpose making it huge would serve. I get freaked out having such a heavy thing hanging off my PCIe slot, especially when transporting my PC.

I can't remember what the 12v connector is called, but I think it handles 600W on its own, and then you have the pcie5.0 power delivery. Honestly, I think dual flow-through should probably cool a lot better. It's basically designed like a 240mm radiator, but with a small gpu pcb stuck directly in the middle. We'll have to wait and see.

There are a lot of new cases that have fans that pull in air from the bottom, which is perfect for this flow through gpu design. I'd stick an aio rad for the cpu in the front of the case as an intake, have bottom intake fans for the gpu and then have exhaust fans in the back and top.

1736486325541.png
 
I'm guessing the evolution is -

Pascal-> FP32/INT32

Turing -> FP32 + INT32

Ampere (Ada)-> FP32 + FP32/INT32

Blackwell-> FP32/INT32 + FP32/INT32
Unfortunately, the FP32 / INT32 split is an oversimplification of how the HW actually works. There are a bunch of execution ports and on GA102 & newer HW, the typical operations are really split into 3 pipelines: FMAHeavy (FP32 FMA/FMUL/FADD and 32x32 integer multiply-add), FMALite (FP32 FMA/FMUL/FADD only), and INT (integer add, logic ops, *and* floating point comparison including FP32 min/max etc...). Each of the 4 SM sub-cores can only issue 1 instruction per clock (with each of those execution pipelines taking 2 cycles minimum to execute any instruction, you can never issue to the same pipeline two cycles in a row given the 16-wide ALUs with 32-wide warps) so it is impossible to saturate all 3 pipelines simultaneously.

The HW is limited by schedule/fetch/decode/issue logic of 1/clk before even considering that this 1/clk also includes all control/branching/special function/memory instructions... of course, in practice, you might be limited by register file bandwidth or other bottlenecks before being limited by the 1/clk rate, but I believe that rate is indeed a fairly common bottleneck.

It's impossible to say exactly what NVIDIA has done here before we get access to the SASS disassembly & can write microbenchmarks, but for example there's no good reason for them to double the 32x32 multiply-add pipeline; actually now that they have the TMA, even 1/2 rate 32x32 IMAD feels very excessive to me (AMD is 1/4 rate iirc) but maybe it still makes sense to keep it for the majority of workloads that do not use TMA...

Doubling the "integer" pipe (which really isn't just integer - e.g. FP32 min/max) would be nice, but what I really want to see is an increase to the instruction fetch/decode rate, and ideally something bettere than their extremely low density 128-bit fixed length ISA ... Hopper has proven low code density doesn't really matter for matrix multiplication & deep learning especially if you just use much longer asynchronous TMA & WGMMA instructions (instruction decoder is actually idle a lot of the time for Hopper matrix multiplication kernels!) but it's still an obvious weakness of the architecture, including for things like raytracing where instruction cache misses are common and NVIDIA's very aggressive prefetching won't work as well (or might even have to be disabled completely for certain workloads afaik).
 
Maybe Gamers Nexus will get a new interview with that Nvidia guy they did the 2 videos with for the last cooler. They were some fascinating videos and he was very open about it all.
 
Maybe someone wants to measure the die size of GB203:

Ggu9yQjaEAAKTke

 
Unfortunately, the FP32 / INT32 split is an oversimplification of how the HW actually works. There are a bunch of execution ports and on GA102 & newer HW, the typical operations are really split into 3 pipelines: FMAHeavy (FP32 FMA/FMUL/FADD and 32x32 integer multiply-add), FMALite (FP32 FMA/FMUL/FADD only), and INT (integer add, logic ops, *and* floating point comparison including FP32 min/max etc...). Each of the 4 SM sub-cores can only issue 1 instruction per clock (with each of those execution pipelines taking 2 cycles minimum to execute any instruction, you can never issue to the same pipeline two cycles in a row given the 16-wide ALUs with 32-wide warps) so it is impossible to saturate all 3 pipelines simultaneously.

The HW is limited by schedule/fetch/decode/issue logic of 1/clk before even considering that this 1/clk also includes all control/branching/special function/memory instructions... of course, in practice, you might be limited by register file bandwidth or other bottlenecks before being limited by the 1/clk rate, but I believe that rate is indeed a fairly common bottleneck.

It's impossible to say exactly what NVIDIA has done here before we get access to the SASS disassembly & can write microbenchmarks, but for example there's no good reason for them to double the 32x32 multiply-add pipeline; actually now that they have the TMA, even 1/2 rate 32x32 IMAD feels very excessive to me (AMD is 1/4 rate iirc) but maybe it still makes sense to keep it for the majority of workloads that do not use TMA...

Doubling the "integer" pipe (which really isn't just integer - e.g. FP32 min/max) would be nice, but what I really want to see is an increase to the instruction fetch/decode rate, and ideally something bettere than their extremely low density 128-bit fixed length ISA ... Hopper has proven low code density doesn't really matter for matrix multiplication & deep learning especially if you just use much longer asynchronous TMA & WGMMA instructions (instruction decoder is actually idle a lot of the time for Hopper matrix multiplication kernels!) but it's still an obvious weakness of the architecture, including for things like raytracing where instruction cache misses are common and NVIDIA's very aggressive prefetching won't work as well (or might even have to be disabled completely for certain workloads afaik).
When was the last architecture in which instruction issue and the ISA was substantially improved? Turing?
 
When was the last architecture in which instruction issue and the ISA was substantially improved? Turing?
Volta was the last "completely new ISA" - every generation since then has mostly "just" been (significant!) additions on top of the same baseline ISA, e.g. Turing uniform registers is a pretty big change but the existing Volta instructions didn't change much. NVIDIA doesn't explain what any SASS instruction actually does, but they do list them in their entirety(?) on this page, and you can see how Hopper is roughly a superset of Volta: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html

NVIDIA only created a new ISA from ~scratch 3 times in the CUDA era: G80, Fermi, and Volta. But Kepler & Maxwell were major ISA changes too (arguably bigger than Turing/Hopper although it's hard to tell from the outside how much of the microarchitecture was redesigned for any of those). There's this great page with the raw ISA encodings for H100 and AD102 which is interesting as well: https://github.com/kuterd/nv_isa_solver?tab=readme-ov-file

I talk (a little bit) about this in my GPU MODE lecture on SASS & GPU Microarchitecture:
 
Last edited:
I can't remember what the 12v connector is called, but I think it handles 600W on its own, and then you have the pcie5.0 power delivery. Honestly, I think dual flow-through should probably cool a lot better. It's basically designed like a 240mm radiator, but with a small gpu pcb stuck directly in the middle. We'll have to wait and see.

There are a lot of new cases that have fans that pull in air from the bottom, which is perfect for this flow through gpu design. I'd stick an aio rad for the cpu in the front of the case as an intake, have bottom intake fans for the gpu and then have exhaust fans in the back and top.

View attachment 12826
Aren't the PSU shrouds usually on the bottom? Where is the PSU going with this flow? Moved up front or really wide case with side mount?
 
Yeah, my newest case is this same dual chamber design. It seems like a bit of wasted space to me buuuuut it does keep the airflow cleaner.
 
Back
Top