Nvidia Blackwell Architecture Speculation

  • Thread starter Deleted member 2197
  • Start date
Here's a look at power target testing on 5090. I haven't finished watching it yet.


It looks like there are some efficiency improvements in blackwell. If you power limit a 5090 to use about the same power as a 4090 it's still much faster. Although I think you can tweak the 4090 in much the same way and lose a few % of performance with a lower power limit. Seems like these things ship well beyond their best point in the efficiency curve, like they're heavily overclocked out of the box.

1737738387109.png
 
Last edited:
nvidia claims that the Blackwell SM has the same INT32 throughput as FP32 but here they still mention only 64 INT32 cores per SM which is half of FP32

Didn't Nvidia only claim that INT32 / FP32 are now both dual-issue? (And in return mixed INT32/FP32+INT32 dual issue was killed off.)

Not a word about the per SM throughput limits or throughput per core. So if FP32 has twice the lately of INT32, then it's still perfectly balanced with 2:1 FP32 to INT32 core ratio.

Don't read too much into the CUDA programming guide, it's just telling you "don't bother refactoring your code for less sequential dependencies if your thread group has already hit 128 FP32 / 64 INT32 ops without dependencies".

Edit: Dual-issue per SM - not per each of the 4 warp schedulers. So this has still reduced the latency till all FP32 even spin up from 4 to 2 cycles. With the catch that mixed FP32/INT32 workload now has additional dispatch latency.
 
Last edited:
The new transformer model is really, really good. Nvidia is not resting on their laurels, that's for sure. The pressure inside AMD, Intel and all the others must be enormous, wouldn't want to be in their position.
 
Didn't Nvidia only claim that INT32 / FP32 are now both dual-issue? (And in return mixed INT32/FP32 dual issue was killed off.)
There never was any "dual issue", INTs and FPs are issued once per clock to one of two available SIMDs in a SM partition.
In Volta/Turing this was a static split on 1 FP SIMD16 and 1 INT SIMD16.
In Ampere/Lovelace it was upgraded to a more dynamic split to 1 FP SIMD16 and 1 FP or INT SIMD16.
Now in Blackwell it's symmetrical - 2 FP/INT SIMD16. Both can run either FP or INT instructions now in each clock.
 
Wasn't the limitation added that both must be now of the same type? So 2FP or 2INT, but the mixture supported since Volta is now gone.
Don't see why such limitation would be needed. Scheduling still happens each clock on one of two SIMDs. If it was possible to run FP+INT previously why wouldn't it be now?
 

Just install the drivers only.

Old drivers using TNN. PT + FG.
image.png


NEW drivers using TNN. PT + FG.
image.png


NEW drivers with CNN. PT + FG.
image.png



For the first time, I can't distinguish a clarity difference between Quality and Native. I don't need to type more than that.

In the demo sequence there are two distinct areas which the TNN model really pays off and something you'll often experience throughout the game.

1. When you come out of the back door and turn the corner, on the ground is the first puddle with a reflection of a power line and post. The fizzle and smearing in that section with the CNN model has always been awful. TNN takes care of it to the point it's better than native.

2. The palm trees at the end, the edges on the small bark elements of the main trunk have notable shimmering with the CNN model. This is also stabilized.

I also found the default sharpness of 0.10 to be too much for TNN and much preferred sharpness at 0 fwiw.
 
I am a little worried my 30-series won't be able to handle the TNN because 30-series has much much lower tensor performance.
You will be more than fine. My old RTX 2060 laptop drops from 88 to 70 FPS in CP2077 on performance mode, and thats with the old drivers. And one of the weakest RTX cards.
 
Back
Top