Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
In the original Lovelace leaks, the AD102 was going to include 144 SMs. And while I think we discovered that's technically true for a die with zero execution units fused off, the actual shipping AD102 product only exposes 128SMs. Thus, with 128 CUDA cores per SM, that provides the 16,384 total CUDA cores of the consumer RTX 4090.

This said, I wonder if the rumored 192SMs on GB202 is the total units on a 100% functional die, or the expected usable units on a production-release die... 192SMs would deliver a 50% increase over the direct predecessor, for 24,576 CUDA cores. This is roughly in line, albeit a little short, of the jump in core counts from the 3090 to the 4090...
I mean Nvidia was planning a 142SM 2.95Ghz Boost 4090 Ti according to XpeaGPU on Twitter in reserve against AMD if the 7900 XTX could dethrone the 4090 even in raster. And Blackwell apparently as at least a doubling of ROPs per GPC and whatever architectural changes planned for Blackwell (I mean Lovelace is basically a souped up Ampere on TSMC Custom N5). So I can see a full 5090 Ti doing like 70-ish% faster raster or so.
 
We can judge by arch and available bandwidth: 384bit bus for power/heat island reasons. 32/21 = 1.52. 32gbps is the fastest "standard" and thus likely, GDDR7 available anytime soon, though Samsung's "we can totally achieve (?deliver though?) 36gbps is indeed 71% faster than the 4090's.

So 50% is likely, 70% as an outside chance. Not much room to get around bandwidth otherwise that's available by early next year so that feels like a pretty good prediction.

Either way I'll go with "Blackwell is a massively scaled up/monolithic arch, again, because all Nvidia cares about at the moment is AI so that's where 99% of the engineering goes". Meaning that ludicrous 128mb SRAM cache leak might be true.
 
I mean Nvidia was planning a 142SM 2.95Ghz Boost 4090 Ti according to XpeaGPU on Twitter in reserve against AMD if the 7900 XTX could dethrone the 4090 even in raster. And Blackwell apparently as at least a doubling of ROPs per GPC and whatever architectural changes planned for Blackwell (I mean Lovelace is basically a souped up Ampere on TSMC Custom N5). So I can see a full 5090 Ti doing like 70-ish% faster raster or so.
What would be the need for a doubling of ROPs?
 
We really don't know much about Blackwell yet. Any SM numbers are thus completely pointless to suggest because we don't know the extent of changes inside these SMs.
 
  • GB207 - 28SM, 96 bit, GDDR7, 3x GPC PCIe 5.0 x4.
  • GB206 - 44SM, 128 bit, GDDR7, 3x GPC PCIe 5.0 x8.
  • GB205 - 72SM, 192 bit, GDDR7, 3x GPC PCIe 5.0 x8.
  • GB203 - 108SM, 256 bit, GDDR7, 3x GPC PCIe 5.0 x16.
  • GB202 - 192SM, 384 bit, GDDR7, 3x GPC PCIe 5.0 x16.
Quick comparison:
AD107 - 24SM, 128 bit | +4 SM, +17%
AD106 - 36 SM, 128 bit | +8 SM, +22%
AD104 - 60SM, 192 bit | +12 SM, +20%
AD103 - 80 SM, 256 bit | +28 SM, +35%
AD102 - 144SM, 384 bit | +48 SM, +33%

They seem like reasonable guesstimates at first glance
 
Depends if it's really pcie5 x8 only, Samsung did just SSD that does pcie5 x2 or x4 on lower speeds
So it would be 5.0x8 or 4.0x16? To do all the work to connect 16 lanes only to disable half of them on most systems would be odd.
 
So 50% is likely, 70% as an outside chance. Not much room to get around bandwidth otherwise that's available by early next year so that feels like a pretty good prediction.
Well a full GB202 being 70% faster raster than a 4090. Defintely don't a full die 5090 Ti having 70+ percent faster raster than the canned 4090 Ti.
What would be the need for a doubling of ROPs?
Increase raster peformance, I get the impression Blackwell is kinda similar to Maxwell in some respects; focusing potentially on "real world" performance over just simpy raw compute, like compare GTX 980 Vs GTX 680 (basicallly GM204 Vs GK104) and you see architecture to architecture a quite large leap on the same process (70% raster jump according to Techpowerup GTX 680 page: https://www.techpowerup.com/gpu-specs/geforce-gtx-680.c342). And since GB202 & 203 are likely to be on 3nm which is a singificant shrink then I can see a pretty good performance jump even if RT & PT get more of a focus. Like if RGT is right and die to die is basically 60% from AD102/103 to GB202/203 in raster with RT & PT way higher then that's a great jump (hell even 50% raster with 100% PT would be great apples to apples i.e. a 5080 Ti/Super with full GB203 die over the 4080 Super).
Quick comparison:
AD107 - 24SM, 128 bit | +4 SM, +17%
AD106 - 36 SM, 128 bit | +8 SM, +22%
AD104 - 60SM, 192 bit | +12 SM, +20%
AD103 - 80 SM, 256 bit | +28 SM, +35%
AD102 - 144SM, 384 bit | +48 SM, +33%

They seem like reasonable guesstimates at first glance
Yeah, which makes me think 202 & 203 use N3E while the other three use N4P which despite being small shrink from N5 is probablyy good enough, like if a full 18GB G7 (3GB modules) GB205 config is on par with a 4080 Super with less power consumption like 250W then that's good enough if priced appropriately.
 
Increase raster peformance, I get the impression Blackwell is kinda similar to Maxwell in some respects; focusing potentially on "real world" performance over just simpy raw compute, like compare GTX 980 Vs GTX 680 (basicallly GM204 Vs GK104) and you see architecture to architecture a quite large leap on the same process (70% raster jump according to Techpowerup GTX 680 page: https://www.techpowerup.com/gpu-specs/geforce-gtx-680.c342). And since GB202 & 203 are likely to be on 3nm which is a singificant shrink then I can see a pretty good performance jump even if RT & PT get more of a focus. Like if RGT is right and die to die is basically 60% from AD102/103 to GB202/203 in raster with RT & PT way higher then that's a great jump (hell even 50% raster with 100% PT would be great apples to apples i.e. a 5080 Ti/Super with full GB203 die over the 4080 Super).
I thought ROPs were of little use now and never a bottleneck in practice.
 
Either way I'll go with "Blackwell is a massively scaled up/monolithic arch, again, because all Nvidia cares about at the moment is AI so that's where 99% of the engineering goes". Meaning that ludicrous 128mb SRAM cache leak might be true.

It’s almost guaranteed to be monolithic given chiplets haven’t been shown to provide any advantages for gaming GPUs. Not yet at least.

Bigger caches would be nice. Bigger register files too. Higher clocks would make it a trifecta.
 
It’s almost guaranteed to be monolithic given chiplets haven’t been shown to provide any advantages for gaming GPUs. Not yet at least.

Bigger caches would be nice. Bigger register files too. Higher clocks would make it a trifecta.

RDNA3 costs less to build than RDNA2, chiplets are all about lowering cost, there's no performance advantage and never will be.
 
RDNA3 costs less to build than RDNA2, chiplets are all about lowering cost, there's no performance advantage and never will be.
Well, at some point it's going to be less expensive to build chiplets versus a monolithic die with the same capacities. AD102 is already 75 billion transistors, which is along the same lines as a full Threadripper 7995wx, which is 96 hardware cores, 192 hardware threads, 486MB of total cache, an octal-channel memory controller, and a 128-lane PCIe Gen5 root complex. Sure, we have a little more room to continue cramming hojillions of transistors into a single die, but we're getting awfully close to the end of the road appearing on the horizon.
 
Well, at some point it's going to be less expensive to build chiplets versus a monolithic die with the same capacities. AD102 is already 75 billion transistors, which is along the same lines as a full Threadripper 7995wx, which is 96 hardware cores, 192 hardware threads, 486MB of total cache, an octal-channel memory controller, and a 128-lane PCIe Gen5 root complex. Sure, we have a little more room to continue cramming hojillions of transistors into a single die, but we're getting awfully close to the end of the road appearing on the horizon.
It already is cheaper according to the post you quoted? In fact the whole post was about it being cheaper to do RDNA3 (chiplets) vs RDNA2 (mono)
 
Sorry, Trinibwoy is where my thoughts were going. I'm not sure if it's objectively true that monolithic dies are cheaper (for now), but I wager there's a point in time where it's literally not possible to create a bigger chip any longer. Somewhere around that point, it won't even be a matter of economics, it will be a matter of physical necessity.
 
Would it be cheaper AMD would have a full line up on 5nm like nVidia. The fact that N33 is on 6nm makes is clear that 5nm still costs a lot more.
 
Would it be cheaper AMD would have a full line up on 5nm like nVidia. The fact that N33 is on 6nm makes is clear that 5nm still costs a lot more.
Of course N5 is more expensive. That's why they use less of it and N6 for rest. N33 being so small chiplets weren't feasible so they went N6 only.
 
Back
Top