Nvidia Blackwell Architecture Speculation

homerdog · Apr 21, 2025

troyan said:
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.

GT200 was pretty good where it ended up being priced, although probably not great for NVIDIA. I think it was the largest GPU ever at the time, and I bought my GTX260 for less than $300. NV's margins were a bit different back then

techuse · Apr 22, 2025

troyan said:
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.

GT200 was also better than Blackwell.

TopSpoiler · May 5, 2025

The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
~~- 32-bit integer multiply, multiply-add, extended-precision multiply-add~~
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.

DegustatoR · May 5, 2025

TopSpoiler said:
- 16-bit floating-point add, multiply, multiply-add

Doesn't seem to be the case in your link. And wouldn't make much sense.
Edit: No, you are right. I wonder what it means though. They've been running FP16 math on tensor ALUs since Turing I believe which would mean that the throughput per SM has went down in Lovelace.

Maillog · May 5, 2025

DegustatoR said:
I believe which would mean that the throughput per SM has went down in Lovelace.

From RTX 30 series FP16 perform with 1:1 rate.
Information about this can be found in nvidia ampere architecture whitepaper

For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.

Details for Result ID NVIDIA GeForce RTX 4070 (5888S 46C SM8.9 2.48GHz, 36MB L2, 12GB 20.5GHz/21GHz 192-bit) (CUDA)

ranker.sisoftware.co.uk

Details for Result ID NVIDIA GeForce RTX 3070 (5888S 46C SM8.6 1.77GHz, 4MB L2, 8GB 13.6GHz/14GHz 256-bit) (CUDA)

ranker.sisoftware.co.uk

Details for Result ID NVIDIA GeForce RTX 2080 (2944S 46C SM7.5 1.51GHz/1.89GHz, 4MB L2, 8GB 13.74GHz/14GHz 256-bit) (CUDA)

ranker.sisoftware.co.uk

Maybe performance of FP16 on tensor cores was limited and didn't always work?

DegustatoR · May 5, 2025

Maillog said:
For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.

According to the spreadsheet only Lovelace has half rate FP16 so Ampere should have the same rate as Turing and Blackwell. I wonder if it's just a mistake in docs.

Maillog · May 5, 2025

DegustatoR said:
I wonder if it's just a mistake in docs.

Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm

fellix · May 6, 2025

TopSpoiler said:
- 32-bit integer multiply, multiply-add, extended-precision multiply-add

Finally the two FMA ALUs on each scheduler have symmetric FP/INT throughput now.

DegustatoR · May 6, 2025

Maillog said:
Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm

That's not what the docs say though.
Turing is CC7.5 and it has 64 FP32 and 128 FP16. This seems correct, Turing did have 2X FP16 rate.
Ampere (CC8.6) doubled the FP32 rate but not the FP16 so it should have 128/128 but the docs show 128/256 as if FP16 rate has also doubled.
Lovelace (CC8.9) shouldn't have any changes in comparison to Ampere rates but it has half of FP16 in the docs for some reason - 128/128.
And Blackwell (12.0) shows doubling of FP16 again back to 128/256.

Considering that Nvidia is doing FP16 math on tensor ALUs this could be a result of tensor h/w changes between architectures but the suggested rate for Ampere seem wrong to me. Blackwell doubling FP16 from Lovelace doesn't make much sense either.

fellix said:
Finally the two FMA ALUs on each scheduler have symmetric FP/INT throughput now.

There's only one scheduler (each SIMD takes at least 2 clocks to run an instruction) and INT capabilities are likely different as the rate for "32-bit integer add, extended-precision add, subtract, extended-precision subtract", "32-bit integer shift", "32-bit integer bit reverse" hasn't changed.

TopSpoiler · Monday at 11:54 AM

TopSpoiler said:
The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
- 32-bit integer multiply, multiply-add, extended-precision multiply-add
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.

The table has been updated again. Blackwell's 32-bit IMUL/IMAD is now identical to Ada, not double.
So, the claim in the architecture whitepaper that INT32 throughput is twice that of Ada can be considered false information.

Edit:
I may have missed it or it may have been updated again, 32-bit IADD and ISUB throughput is double of Ada.

DegustatoR · 2025-05-13T14:03:38+0100

EEC Filing Confirms Nvidia RTX 5050 with 8 GB Graphics Memory

Recent filings by the Shangke Group with the Eurasian Economic Commission show that the card will ship in several versions under the Maxus label, each tagged with 8 GB of video memory.

www.guru3d.com

This is presumably GB207. Interesting that they've decided to keep 128 bit bus for it. 3GB G7 must be really expensive.

homerdog · 2025-05-13T21:39:23+0100

DegustatoR said:
EEC Filing Confirms Nvidia RTX 5050 with 8 GB Graphics Memory

Recent filings by the Shangke Group with the Eurasian Economic Commission show that the card will ship in several versions under the Maxus label, each tagged with 8 GB of video memory.

www.guru3d.com

This is presumably GB207. Interesting that they've decided to keep 128 bit bus for it. 3GB G7 must be really expensive.

It says it's GDDR6. 3GB GDDR7 is probably a lot more expensive than 2GB GDDR6.

Edit: there are a few inaccuracies in the article. They claim higher tier models use GDDR6X (they don't) and the 3050 had a larger memory bus (it didn't).

DegustatoR · 2025-05-14T00:21:37+0100

homerdog said:
It says it's GDDR6. 3GB GDDR7 is probably a lot more expensive than 2GB GDDR6.

That's my point. You'd think that for this chip a 64 or 96 bit bus with fast 3GB G7 would be more preferable. But the costs structure on the latter is seemingly prohibitive for now.

Nvidia Blackwell Architecture Speculation

homerdog

donator of the year

techuse

TopSpoiler

DegustatoR

Maillog

Details for Result ID NVIDIA GeForce RTX 4070 (5888S 46C SM8.9 2.48GHz, 36MB L2, 12GB 20.5GHz/21GHz 192-bit) (CUDA)

Details for Result ID NVIDIA GeForce RTX 3070 (5888S 46C SM8.6 1.77GHz, 4MB L2, 8GB 13.6GHz/14GHz 256-bit) (CUDA)

Details for Result ID NVIDIA GeForce RTX 2080 (2944S 46C SM7.5 1.51GHz/1.89GHz, 4MB L2, 8GB 13.74GHz/14GHz 256-bit) (CUDA)

DegustatoR

Maillog

fellix

DegustatoR

TopSpoiler

DegustatoR

EEC Filing Confirms Nvidia RTX 5050 with 8 GB Graphics Memory

homerdog

donator of the year

EEC Filing Confirms Nvidia RTX 5050 with 8 GB Graphics Memory

DegustatoR