Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
And it doesnt make sense to use Fermi. Fermi was on 40nm. GT200 is a better comparision to Blackwell while it was on the same 65nm as G92b.
GT200 was pretty good where it ended up being priced, although probably not great for NVIDIA. I think it was the largest GPU ever at the time, and I bought my GTX260 for less than $300. NV's margins were a bit different back then :)
 
The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
- 32-bit integer multiply, multiply-add, extended-precision multiply-add
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.
 
Last edited:
- 16-bit floating-point add, multiply, multiply-add
Doesn't seem to be the case in your link. And wouldn't make much sense.
Edit: No, you are right. I wonder what it means though. They've been running FP16 math on tensor ALUs since Turing I believe which would mean that the throughput per SM has went down in Lovelace.
 
Last edited:
I believe which would mean that the throughput per SM has went down in Lovelace.
From RTX 30 series FP16 perform with 1:1 rate.
Information about this can be found in nvidia ampere architecture whitepaper
Screenshot_2.png


For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.



Maybe performance of FP16 on tensor cores was limited and didn't always work?
 
For real life example, here are the GPGPU sisoftsandra tests where 2000 demonstrates 2x uplift with half unlike to 30 and 40 series.
According to the spreadsheet only Lovelace has half rate FP16 so Ampere should have the same rate as Turing and Blackwell. I wonder if it's just a mistake in docs.
 
I wonder if it's just a mistake in docs.
Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm
 
Should be remembered that Ampere increased the number of FP blocks per sm. And since the execution speed became 1:1, the speed of FP16 execution per one SM still increased compared to turing
Turing with rate 2:1-128 per sm
Turing with rate 1:1 64 per sm(?)
Ampere/Ada with rate 1:1 128 per sm because number of blocks has increased compared to turing. With 2:1 rate it should be 256
Blackwell with rate 2:1-256 per sm
That's not what the docs say though.
Turing is CC7.5 and it has 64 FP32 and 128 FP16. This seems correct, Turing did have 2X FP16 rate.
Ampere (CC8.6) doubled the FP32 rate but not the FP16 so it should have 128/128 but the docs show 128/256 as if FP16 rate has also doubled.
Lovelace (CC8.9) shouldn't have any changes in comparison to Ampere rates but it has half of FP16 in the docs for some reason - 128/128.
And Blackwell (12.0) shows doubling of FP16 again back to 128/256.

Considering that Nvidia is doing FP16 math on tensor ALUs this could be a result of tensor h/w changes between architectures but the suggested rate for Ampere seem wrong to me. Blackwell doubling FP16 from Lovelace doesn't make much sense either.

Finally the two FMA ALUs on each scheduler have symmetric FP/INT throughput now.
There's only one scheduler (each SIMD takes at least 2 clocks to run an instruction) and INT capabilities are likely different as the rate for "32-bit integer add, extended-precision add, subtract, extended-precision subtract", "32-bit integer shift", "32-bit integer bit reverse" hasn't changed.
 
Last edited:
The CC 12.x(RTX Blackwell) has been added to the instruction throughput table.

Here are the instructions that RTX Blackwell has twice the throughput of Ada:
- 16-bit floating-point add, multiply, multiply-add
- 32-bit integer multiply, multiply-add, extended-precision multiply-add
- compare, minimum, maximum
- warp vote

Additionally, RTX Blackwell supports DPX instructions, 64-bit IADD, and IMUL, which were not natively supported in gaming architectures.

The table has been updated again. Blackwell's 32-bit IMUL/IMAD is now identical to Ada, not double.
So, the claim in the architecture whitepaper that INT32 throughput is twice that of Ada can be considered false information.😠

Edit:
I may have missed it or it may have been updated again, 32-bit IADD and ISUB throughput is double of Ada.
 
Last edited:
This is presumably GB207. Interesting that they've decided to keep 128 bit bus for it. 3GB G7 must be really expensive.
It says it's GDDR6. 3GB GDDR7 is probably a lot more expensive than 2GB GDDR6.

Edit: there are a few inaccuracies in the article. They claim higher tier models use GDDR6X (they don't) and the 3050 had a larger memory bus (it didn't).
 
Back
Top