NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
Am on a 3900x/2080Ti too. And in all honesty theres zero reason to upgrade other than even higher fidelity and performance. The most logical upgrade would indeed be a 4090, but then a new CPU would be needed too to not bottleneck the thing too much. And yea, while Ryzen 7000 is impressive performance wise, i think raptor lake will stomp all over it.
5800X3D is the perfect gaming CPU indeed, its like the thing is made for gaming.
I'm on a 2080 Ti/9900K gaming at 3440x1440/120Hz. While it's enough for 99% of the games out there, it can fall short in games with heavy RT such as Cyberpunk, and even in some demanding games without RT, it can still dip below the 60fps mark at times. Can't say I'm not tempted by something that is ostensibly 2.5x the power of my card. Seeing my fps jump from 60 to 120+ would be something. The 4090 would murder everything at 3440x1440 and I don't think any game without massive RT would make it come close to dropping below 60fps even at max settings.

Though like you, I might need to upgrade my CPU as well.
 
Look at the whitepaper for ada: nvidia-ada-gpu-architecture.pdf

Why is the 4090 worse than the 4080 at some of the tensor operations?

40904080 16GB
Peak FP8 Tensor TFLOPS
with FP16 Accumulate
660.6/1321.2389.9/779.8
Peak FP8 Tensor TFLOPS
with FP32 Accumulate
660.6/1321.2389.9/779.8
Peak FP16 Tensor TFLOPS
with FP16 Accumulate
330.3/660.6194.9/389.8
Peak FP16 Tensor TFLOPS
with FP32 Accumulate
165.2/330.4194.9/389.8
Peak BF16 Tensor TFLOPS
with FP32 Accumulate
165.2/330.4194.9/389.8
Peak TF32 Tensor TFLOPS82.6/165.297.5/195
Peak INT8 Tensor TOPS660.6/1321.2389.9/779.82
Peak INT4 Tensor TOPS1321.2/2642.4779.8/1559.6
 
I'm on a 2080 Ti/9900K gaming at 3440x1440/120Hz. While it's enough for 99% of the games out there, it can fall short in games with heavy RT such as Cyberpunk, and even in some demanding games without RT, it can still dip below the 60fps mark at times. Can't say I'm not tempted by something that is ostensibly 2.5x the power of my card. Seeing my fps jump from 60 to 120+ would be something.

Imagine my temptation coming from a 1070! The 2080Ti is already easily twice as fast as my GPU so I'd be looking at more like 5x the raw performance, before DLSS!! Since DLSS 3 can comfortably give 3x additional performance vs native then I'd be looking at about 15x the real world performance in supported games for a seemingly minor visual degradation. I'm gaming at 3840x1600 on a 144hz monitor so I expect the only thing stopping me hitting that limit in most scenario's will be my 3700x. TBH though as long as I can lock in 60fps as a minimum in pretty much all games (which should be mostly doable) then I'm fine with that. I'd actually be quite happy using the 4090's extra performance on DLDSR or DLAA where available as even 3840x1600 seems a bit low to me in some games.

I wonder if I could run DLDSR at 2.25x at the same time as running DLSS Quality mode? On a 1440p screen for example that would be like telling the game to run at 4k internally and scale that down to 1440p, but with DLSSQ, that 4k internal res would actually be 1440p internal scaled up to 4k by DLSS, and then back down to 1440p by DLDSR. I can imagine that would give some pretty great image quality results with a cost similar to just running at the screens native resolution.
 
Likely an error in the 4080's FP16/FP32, BF16/FP32 accumulate and TF32 numbers which should be halved. It does say:

The GeForce RTX 4090 offers double the throughput for existing FP16, BF16, TF32, and INT8 formats

However they don't specify changes in ops/cycle like in previous whitepapers and the 4090's numbers are all roughly double the 3090 Ti's so I'm leaning towards that
 
Last edited:
screenshot2022-10-022nifhf.png


One thing I didn't notice before in this slide is basically the three games are 3 different scenarios:
  1. CPU bound
  2. GPU bound
  3. GPU bound + RT
The uplift you get from DLSS 3.0 seems to be constant between the 3 different SKUs (matching the whitepaper numbers), so 4080 12/16 gb (and 3080 ti) were entirely CPU bound running flight sim, so they would perform exactly the same with DLSS 3.0. Not sure what to make of Darktide, could it be the 4080 12gb runs slower than the 3080 Ti without DLSS 3.0? I guess that's what reviews are for :p

Cyberpunk is likely a combination of the improved RT cores + SER + DLSS 3.0, so the ideal case for Ada.
 
Last edited:
screenshot2022-10-022nifhf.png


One thing I didn't notice in this slide is basically the three games are 3 different scenarios:
  1. CPU bound
  2. GPU bound
  3. GPU bound + RT
The uplift you get from DLSS 3.0 seems to be constant between the 3 different SKUs (matching the whitepaper numbers), so 4080 12/16 gb (and 3080 ti) were entirely CPU bound running flight sim, so they would perform exactly the same with DLSS 3.0. Not sure what to make of Darktide, could it be the 4080 12gb runs slower than the 3080 Ti without DLSS 3.0? I guess that's what reviews are for :p

Cyberpunk is likely a combination of the improved RT cores + SER + DLSS 3.0, so the ideal case for Ada.
Nvidia is hiding raster performance 🤔 .. the gains must not be impressive or ?
 
Nvidia is hiding raster performance 🤔 .. the gains must not be impressive or ?

I don't know about hiding. The performance slides during the launch announcement had the RTX 4080 12G slower than the 3090ti in all 3 "raster" games.

Realistically there is basically a x2 memory bandwidth difference with performance comparisons being done at 4k "max settings" which is going to be challenging for cache. Based on current information there is nothing to suggest per tflop gains at raster gaming at the SM level for Ada without any specific developer intervention. Unless there is some "secret sauce" they are hiding, it's likely in practice 4090 12G will have a per tflop perf deficit against GA102 configs.

In terms of the overall architecture this is a bit more tricky. Throwing aside pricing (since that involves a lot more market driven factors) the reality is that AD102/104 (~100% in hardware resources) sits much further apart than GA102/104 (~50% gap).
 
I don't know about hiding. The performance slides during the launch announcement had the RTX 4080 12G slower than the 3090ti in all 3 "raster" games.

Realistically there is basically a x2 memory bandwidth difference with performance comparisons being done at 4k "max settings" which is going to be challenging for cache. Based on current information there is nothing to suggest per tflop gains at raster gaming at the SM level for Ada without any specific developer intervention. Unless there is some "secret sauce" they are hiding, it's likely in practice 4090 12G will have a per tflop perf deficit against GA102 configs.

In terms of the overall architecture this is a bit more tricky. Throwing aside pricing (since that involves a lot more market driven factors) the reality is that AD102/104 (~100% in hardware resources) sits much further apart than GA102/104 (~50% gap).
Mayb the l2 cache at 4k the hit rate is low like im rdna 2 ?
 
RTX 4080 12GB is basically a 3090TI with less units and higher clocks. Only the bandwidth is 1/2 of the 3090TI:
3090ti
FP16/32: 40 TFLOPs
Pixel Fillrate: 208,3 gpixel/s
Texel Fillrate: 625 gtexel/s
Rasterizing: 1,300 mtriangles/s

4080 12GB:
FP16/32: 40 TFLOPs
Pixel Fillrate: 208,8 gpixel/s
Texel Fillrate: 626 gtexel/s
Rasterizing: 1,300 mtriangles/s

RTX4090 (vs. 4080 12GB)
FP16/32: 82,6 TFLOPs (2,06x)
Pixel Fillrate: 443,5gpixel/s (2,12x)
Texel Fillrate: 1290 gtexel/s (2,06x)
Rasterizing: 2,755 mtriangles/s (2,12x)

Isnt it more interessting that the 4090 doesnt scale better up? The 4080 12GB looks fine from a raw performance standpoint. But the 4090 should be twice as fast because it has nearly twice of everything.
 
But the 4090 should be twice as fast because it has nearly twice of everything.
Except power limit, so maybe it boosts to much lower clocks comparatively. But clocks don’t scale linearly with power/voltage so maybe it’s not that. Maybe the 4090 is too big to be fed properly in most games. There’s probably some workloads where the raw compute power of the 4090 will show its face, hopefully reviewers won’t cover only games.
 
Yes, power limit is only 60% higher on the 4090. So the performance improvement schould be between 60% and 70% because the 4080 12GB will boost higher.

nVidia has shown that the 4090FE can boost up to 2820MHz in Cyberpunk with RT in 1440p: https://wccftech.com/nvidia-geforce...k-2077-dlss-3-cuts-gpu-wattage-by-25-percent/

That is 12,7% higher than the boost clock. So efficiency will go down. I can only assume that the 4080 12GB will clock to 3GHz within 285W...
 
Look at the whitepaper for ada: nvidia-ada-gpu-architecture.pdf

Why is the 4090 worse than the 4080 at some of the tensor operations?

40904080 16GB
Peak FP8 Tensor TFLOPS
with FP16 Accumulate
660.6/1321.2389.9/779.8
Peak FP8 Tensor TFLOPS
with FP32 Accumulate
660.6/1321.2389.9/779.8
Peak FP16 Tensor TFLOPS
with FP16 Accumulate
330.3/660.6194.9/389.8
Peak FP16 Tensor TFLOPS
with FP32 Accumulate
165.2/330.4194.9/389.8
Peak BF16 Tensor TFLOPS
with FP32 Accumulate
165.2/330.4194.9/389.8
Peak TF32 Tensor TFLOPS82.6/165.297.5/195
Peak INT8 Tensor TOPS660.6/1321.2389.9/779.82
Peak INT4 Tensor TOPS1321.2/2642.4779.8/1559.6

Whitepaper updated to v1.01.

Only changes are RTX 4080 16G and 12G numbers in question are now 1/2 of before and in line with 4090.
 
Another interessting comparision is 4080 12GB and 3070Ti:
3070TI:
FP16/32: 22 TFLOPs
Pixel Fillrate: 169,92 gpixel/s
Texel Fillrate: 339,84 gtexel/s
Rasterizing: 1,062 mtriangles/s

The 4080 12GB has twice the compute and texel but only 22% higher pixel and rasterizing performance. In AC:Valhalla the 3090TI is only up to 40% faster in 4K:

nVidia published "numbers" for Valhalla: https://www.nvidia.com/en-us/geforce/news/rtx-40-series-community-qa/

So, there is something limiting nVidia GPUs in rasterizing games and it is not really tied to bandwidth.
 
I really want to know where all of the transistor budget went? A100 has 54 billion transistors, and 19.5 TF of FP32 compute, fast forward to H100 and it has 80 billion transistors and more than triple the amount of FP32 compute, 67 TF. Yet Ada barely doubled FP32 despite spending close to 3X the transisor count.


A100: 54 billion transistors, 19.5 TF of FP32 compute
H100: 80 billion transistors, 67TF of FP32 compute

Ampere: 28 billion transistors, 40TF FP32
Ada: 76 billion transistors, ~90TF FP32

Something doesn't add up.
 
Last edited:
I really want to know where all of the transistor budget went? A100 has 54billion transistors, and 19.5 TF of FP32 compute, fast forward to H100 and it has 80billion transistors and more than the triple the amount of FP32 compute, 67 TF. Yet Ada barely doubled FP32 despite spending close to 3X the transisor count.


A100: 54billion transistors, 19.5 TF of FP32 compute
H100: 80billion transistors, 67TF of FP32 compute

Ampere: 28 billion transistors, 40TF FP32
Ada: 76 billion transistors, ~90TF FP32

Something doesn't add up.
Cache is still rather opaque in the spec lists so far.
 
Status
Not open for further replies.
Back
Top