Nvidia Blackwell Architecture Speculation

homerdog · Apr 21, 2025

troyan said:
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.

GT200 was pretty good where it ended up being priced, although probably not great for NVIDIA. I think it was the largest GPU ever at the time, and I bought my GTX260 for less than $300. NV's margins were a bit different back then

techuse · Apr 22, 2025

troyan said:
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.

GT200 was also better than Blackwell.

TopSpoiler · 2025-05-05T13:34:15+0100

The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
- 32-bit integer multiply, multiply-add, extended-precision multiply-add
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.

DegustatoR · 2025-05-05T13:51:55+0100

TopSpoiler said:
- 16-bit floating-point add, multiply, multiply-add

Doesn't seem to be the case in your link. And wouldn't make much sense.
Edit: No, you are right. I wonder what it means though. They've been running FP16 math on tensor ALUs since Turing I believe which would mean that the throughput per SM has went down in Lovelace.

Maillog · 2025-05-05T19:25:39+0100

DegustatoR said:
I believe which would mean that the throughput per SM has went down in Lovelace.

From RTX 30 series FP16 perform with 1:1 rate.
Information about this can be found in nvidia ampere architecture whitepaper

For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.

Details for Result ID NVIDIA GeForce RTX 4070 (5888S 46C SM8.9 2.48GHz, 36MB L2, 12GB 20.5GHz/21GHz 192-bit) (CUDA)

ranker.sisoftware.co.uk

Details for Result ID NVIDIA GeForce RTX 3070 (5888S 46C SM8.6 1.77GHz, 4MB L2, 8GB 13.6GHz/14GHz 256-bit) (CUDA)

ranker.sisoftware.co.uk

Details for Result ID NVIDIA GeForce RTX 2080 (2944S 46C SM7.5 1.51GHz/1.89GHz, 4MB L2, 8GB 13.74GHz/14GHz 256-bit) (CUDA)

ranker.sisoftware.co.uk

Maybe performance of FP16 on tensor cores was limited and didn't always work?

DegustatoR · 2025-05-05T19:30:10+0100

Maillog said:
For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.

According to the spreadsheet only Lovelace has half rate FP16 so Ampere should have the same rate as Turing and Blackwell. I wonder if it's just a mistake in docs.

Maillog · 2025-05-05T19:50:53+0100

DegustatoR said:
I wonder if it's just a mistake in docs.

Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm

Nvidia Blackwell Architecture Speculation

donator of the year