Nvidia Blackwell Architecture Speculation

DegustatoR · Jan 24, 2025

Boss said:
has a different set of artifacts and problems that are very clearly visible.

I certainly wouldn't say that they are "very clearly visible".

Flappy Pannus · Jan 24, 2025

Charlietus said:
Is this a serious outlet? That is so childish

Never, never look at the replies to their articles.

Boss · Jan 24, 2025

DegustatoR said:
I certainly wouldn't say that they are "very clearly visible".

That's fair. I feel they're very easy to spot once you know where to look but, in the moment to moment gameplay, I can see how it would be easy to miss/ignore.

Scott_Arm · Jan 24, 2025

Here's a look at power target testing on 5090. I haven't finished watching it yet.

It looks like there are some efficiency improvements in blackwell. If you power limit a 5090 to use about the same power as a 4090 it's still much faster. Although I think you can tweak the 4090 in much the same way and lose a few % of performance with a lower power limit. Seems like these things ship well beyond their best point in the efficiency curve, like they're heavily overclocked out of the box.

vola · Jan 24, 2025

nvidia claims that the Blackwell SM has the same INT32 throughput as FP32 but here they still mention only 64 INT32 cores per SM which is half of FP32

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-12-0

Ext3h · Jan 25, 2025

vola said:
nvidia claims that the Blackwell SM has the same INT32 throughput as FP32 but here they still mention only 64 INT32 cores per SM which is half of FP32

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-12-0

Didn't Nvidia only claim that INT32 / FP32 are now both dual-issue? (And in return mixed INT32/FP32+INT32 dual issue was killed off.)

Not a word about the per SM throughput limits or throughput per core. So if FP32 has twice the lately of INT32, then it's still perfectly balanced with 2:1 FP32 to INT32 core ratio.

Don't read too much into the CUDA programming guide, it's just telling you "don't bother refactoring your code for less sequential dependencies if your thread group has already hit 128 FP32 / 64 INT32 ops without dependencies".

Edit: Dual-issue per SM - not per each of the 4 warp schedulers. So this has still reduced the latency till all FP32 even spin up from 4 to 2 cycles. With the catch that mixed FP32/INT32 workload now has additional dispatch latency.

Dictator · Jan 25, 2025

Boss said:
The new model while improved has a different set of artifacts and problems that are very clearly visible. Also there is an increased performance cost on the 4080 super.

Super Resolution or Ray Reconstruction? RR is considered "final version" atm, while Super Resolution is labeled as "beta."

Charlietus · Jan 25, 2025

The new transformer model is really, really good. Nvidia is not resting on their laurels, that's for sure. The pressure inside AMD, Intel and all the others must be enormous, wouldn't want to be in their position.

DegustatoR · Jan 25, 2025

Ext3h said:
Didn't Nvidia only claim that INT32 / FP32 are now both dual-issue? (And in return mixed INT32/FP32 dual issue was killed off.)

There never was any "dual issue", INTs and FPs are issued once per clock to one of two available SIMDs in a SM partition.
In Volta/Turing this was a static split on 1 FP SIMD16 and 1 INT SIMD16.
In Ampere/Lovelace it was upgraded to a more dynamic split to 1 FP SIMD16 and 1 FP or INT SIMD16.
Now in Blackwell it's symmetrical - 2 FP/INT SIMD16. Both can run either FP or INT instructions now in each clock.

Ext3h · Jan 25, 2025

DegustatoR said:
Now in Blackwell it's symmetrical - 2 FP/INT SIMD16. Both can run either FP or INT instructions now in each clock.

Wasn't the limitation added that both must be now of the same type? So 2FP or 2INT, but the mixture supported since Volta is now gone.

DegustatoR · Jan 25, 2025

Ext3h said:
Wasn't the limitation added that both must be now of the same type? So 2FP or 2INT, but the mixture supported since Volta is now gone.

Don't see why such limitation would be needed. Scheduling still happens each clock on one of two SIMDs. If it was possible to run FP+INT previously why wouldn't it be now?

RobertR1 · Jan 25, 2025

CUDA Toolkit 12.1 Downloads

Get the latest feature updates to NVIDIA's proprietary compute stack.

developer.nvidia.com

Just install the drivers only.

Old drivers using TNN. PT + FG.

NEW drivers using TNN. PT + FG.

NEW drivers with CNN. PT + FG.

For the first time, I can't distinguish a clarity difference between Quality and Native. I don't need to type more than that.

In the demo sequence there are two distinct areas which the TNN model really pays off and something you'll often experience throughout the game.

1. When you come out of the back door and turn the corner, on the ground is the first puddle with a reflection of a power line and post. The fizzle and smearing in that section with the CNN model has always been awful. TNN takes care of it to the point it's better than native.

2. The palm trees at the end, the edges on the small bark elements of the main trunk have notable shimmering with the CNN model. This is also stabilized.

I also found the default sharpness of 0.10 to be too much for TNN and much preferred sharpness at 0 fwiw.

Scott_Arm · Jan 25, 2025

I am a little worried my 30-series won't be able to handle the TNN because 30-series has much much lower tensor performance.

Dampf · Jan 25, 2025

Scott_Arm said:
I am a little worried my 30-series won't be able to handle the TNN because 30-series has much much lower tensor performance.

You will be more than fine. My old RTX 2060 laptop drops from 88 to 70 FPS in CP2077 on performance mode, and thats with the old drivers. And one of the weakest RTX cards.

fellix · Jan 25, 2025

Source:

DavidGraham · Jan 26, 2025

Plague Tale DLSS 2.4 Quality vs DLSS 4 Performance. Giant improvement in quality despite lower resolution and ~10 more fps in 1440p. New version still struggles with tiny lines such as fishing line though.

https://www.reddit.com/r/nvidia/comments/1i9y1gk/plague_tale_dlss_24_quality_vs_dlss_4_performance

homerdog · Jan 26, 2025

fellix said:
View attachment 12955

Source:

Is this to scale?

fellix · Jan 26, 2025

homerdog said:
Is this to scale?

Not really. You can look here for proper scaled sizes:

https://flic.kr/p/2nv8YoE

RobertR1 · Jan 26, 2025

From Reddit

DavidGraham · Jan 26, 2025

Native is dead, DLSS TNN Performance mode looks better than native.

Nvidia Blackwell Architecture Speculation

DegustatoR

Flappy Pannus

Boss

Scott_Arm

vola

Ext3h

Dictator

Charlietus

DegustatoR

Ext3h

DegustatoR

RobertR1

Pro

CUDA Toolkit 12.1 Downloads

Scott_Arm

Dampf

fellix

DavidGraham

homerdog

donator of the year

fellix

RobertR1

Pro

DavidGraham