Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
That said I expect RTX 60 GeForce to use 3GB modules across the board (maybe 6090 has 42GB and a 6090 Ti with 48GB? And hopefully RDNA 5/UDNA Gen 1 has a top end option as well).
This would depend on the bus widths, which themselves would depend on the memory speeds. The RTX 6060 and 6070 would likely be bandwidth-limited if they retain the 128-bit and 192-bit memory buses of Lovelace and Blackwell unless they use faster GDDR7 than Blackwell is using. And if they do bump up the bus widths a tier (going back to Ampere sizes) then I doubt Nvidia would be generous enough to go with 3GB modules.
 
This would depend on the bus widths, which themselves would depend on the memory speeds. The RTX 6060 and 6070 would likely be bandwidth-limited if they retain the 128-bit and 192-bit memory buses of Lovelace and Blackwell unless they use faster GDDR7 than Blackwell is using. And if they do bump up the bus widths a tier (going back to Ampere sizes) then I doubt Nvidia would be generous enough to go with 3GB modules.
I haven't seen anyone benchmark Blackwell with varying memory speeds but I suspect it has far more bandwidth than it needs. A 5070 is slower on average than a 4070 ti but has 33% more memory bandwidth. Even with no increase in memory bandwidth or architecture improvements - just a node shrink and more SMs - I'm sure a 6070 could easily be ~40% faster than a 5070 on the same 192 bit 30 Gbps memory layout.

And it'll probably be more like 34-36 Gbps GDDR7, so 60-70% faster on the same bus width would still be the same perf:bandwidth ratio as Ada.
 
Last edited:
That said I expect RTX 60 GeForce to use 3GB modules across the board (maybe 6090 has 42GB and a 6090 Ti with 48GB? And hopefully RDNA 5/UDNA Gen 1 has a top end option as well).

I wouldn't necessarily assume that. They were selective with Ampere between 1GB and 2GB. From a product segmentation perspective they could view 16GB as "enough" and leave those ones with 2GB akin to the 256bit Ampere configurations.
 
The hits keep coming.

Alleged fire hazard recall on 5090s.

Your Game Specialist has retracted the claim that Nvidia has issued an RTX 5090 recall. The customer who was told that they could not receive their graphics card due to the recall is now getting their unit next week and the retail outlet is conducting an internal investigation. The new story can be found here.
 
Last edited by a moderator:
If NVIDIA lived only from PC GPUs, it would be on the verge of dying of success.

More than 78 million PC GPUs were shipped during the last quarter of 2024. 😮😮

NVIDIA leads this market with a 65% share, while Intel and AMD settle for 16% and 18%.


That's PC GPU shipments including IGPs. Discrete GPU shipments is more like 70m per year not per quarter.

Intel leads with 65%, followed by AMD with Nvidia last at 16%.
 
1741569025050.png

And another one:

1741589038332.png

Couple of interesting point vs 40 series:
  • (130/)150/180W vs 115/160(/165) (4060/4060Ti8(/Ti16)). 5060Ti16 will probably have more than 180W TDP as well.
  • All cards get some minor bumps in SPs:
    • 5050 2560 vs something (3050 has 2560 as well but likely way lower clocks)
    • 5060 3840 vs 3072 in 4060
    • 5060Ti 4608 vs 4352 in 5060Ti
  • If we use the 2.7GHz as a sustained boost clock this gives us:
    • 5050 ~14TF (vs 16.5 on 4060)
    • 5060 ~21TF (vs 23.5 on 4060Ti)
    • 5060Ti ~25TF (vs 32 on 4070)
  • So these will likely be similar to 5070 vs 4070 Super where the memory will let them be on par or even faster maybe than the previous one tier higher positioned 40 series parts.
  • The exception is the 5050 though which will use G6 and will thus likely end up being slower than the 4060.
  • I also have doubts about 5060Ti being able to outrun 4070 - these will likely end up being even at best.
  • It's interesting that GB207 has kept 128 bit bus. This chip will seemingly end up being very small - AD107 is 150mm^2 with +20% of SPs.
 
Last edited:
  • If we use the 2.7GHz as a sustained boost clock this gives us:
    • 5050 ~14TF (vs 16.5 on 4060)

The 4060 already boosts to around 2.7Ghz. With fewer SMs what would be the reason for increased TDP on the 5050?

These will all likely be the same performance as Ada parts with a small discount just like GB205.
 
The 4060 already boosts to around 2.7Ghz. With fewer SMs what would be the reason for increased TDP on the 5050?
Dunno. But I doubt that it will be able to outdo 4060 as it's using the same G6 while having less SPs. At best it could be on par.
So my expectation would be 5050 at best on par with 4060; 5060 probably reaching and exceeding 4060Ti a bit; and 5060Ti being a bit below 4070 most likely.
 
If it's under 350€ it's going to be good. If it's 300€ it's exceptional.

It's going to be 400€.
5070 is 550 and there's 8 and 16GB 5060 Ti models below that.
My guess would be 400/450-500 for 5060 Tis which means 300 for 5060 and 350 for the possible 12GB model.
But this is USD and MSRPs, who knows what that will be in EUR.
 
5070 is 550 and there's 8 and 16GB 5060 Ti models below that.
My guess would be 400/450-500 for 5060 Tis which means 300 for 5060 and 350 for the possible 12GB model.
But this is USD and MSRPs, who knows what that will be in EUR.
NVIDIA follows the USD>EUR exchange rate at launch + VAT pretty closely. Retail prices is completely another matter of course.
 
5070 is 550 and there's 8 and 16GB 5060 Ti models below that.
My guess would be 400/450-500 for 5060 Tis which means 300 for 5060 and 350 for the possible 12GB model.
But this is USD and MSRPs, who knows what that will be in EUR.
On Amazon here the cheapest 5070 is 789€. I almost forgot for a moment there that official prices don't mean anything.
 
Did anyone notice, how Nvidia explicitly spelled out "Dense FP4" in their keynote-slide about GB300 NVL72 being 50% faster than GB200? Turns out, this ist the only data format, where GB300 is faster than it's predecessor.
All I've found about this being faster was not discriminated for data formats, but being attributed to maybe more power, more clock speed.

And could the double-rate attention instructions have something to do with it? Obviously, they don't affect throughput with sparsity. 1742461244197.png
Sources:
Keynote (timestamped):
https://www.nvidia.com/en-us/data-center/gb300-nvl72/?nvid=nv-int-solr-857371-vt34
https://www.nvidia.com/en-us/data-center/gb200-nvl72/?nvid=nv-int-solr-857371-vt34
also here: https://www.nvidia.com/en-us/data-center/hgx/ (some typos here? INT8, FP64?)
 
Turns out, this ist the only data format, where GB300 is faster than it's predecessor.
It the same chip but with more and faster RAM so all the improvements will be limited (if you could say that) to cases where more RAM is beneficial.
It was the same upgrade with Hopper btw.
 
It the same chip but with more and faster RAM so all the improvements will be limited (if you could say that) to cases where more RAM is beneficial.
It was the same upgrade with Hopper btw.
Hopper was not touted as having 50% more raw FLOPS, IIRC. Only with certain LLM applications that did not fit in 1st-half-Hopper-Gens smaller local memory.
If you remember otherwise, please provide a source.
 
That's really quite interesting. I guess they figured that, as Tensor Memory already needs to support 50% more read bandwidth for structured sparsity (the B tile is twice the size of the A tile), and FP4 multipliers are really cheap, they might as well cram in 50% more of them. I do hope it's implemented in a way that doesn't require a multiple of 3 for matrix tile dimensions.
 
Back
Top