Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.
GT200 was pretty good where it ended up being priced, although probably not great for NVIDIA. I think it was the largest GPU ever at the time, and I bought my GTX260 for less than $300. NV's margins were a bit different back then :)
 
The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
- 32-bit integer multiply, multiply-add, extended-precision multiply-add
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.
 
- 16-bit floating-point add, multiply, multiply-add
Doesn't seem to be the case in your link. And wouldn't make much sense.
Edit: No, you are right. I wonder what it means though. They've been running FP16 math on tensor ALUs since Turing I believe which would mean that the throughput per SM has went down in Lovelace.
 
Last edited:
I believe which would mean that the throughput per SM has went down in Lovelace.
From RTX 30 series FP16 perform with 1:1 rate.
Information about this can be found in nvidia ampere architecture whitepaper
Screenshot_2.png


For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.



Maybe performance of FP16 on tensor cores was limited and didn't always work?
 
For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.
According to the spreadsheet only Lovelace has half rate FP16 so Ampere should have the same rate as Turing and Blackwell. I wonder if it's just a mistake in docs.
 
I wonder if it's just a mistake in docs.
Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm
 
Back
Top