Nvidia Post-Volta (Ampere?) Rumor and Speculation Thread

Status
Not open for further replies.
For the same reason they included the INT unit in the first place, I suppose (it's more efficient?). This move (if at all real and not fake) would simply fix the FP:INT ratio. Because, right now in Turing there's nothing to switch to, nothing to schedule to the INT pipe in 64% of cases, since there's supposedly only 36 INT per 100 FP instructions.

Per Nvidia docs one of the benefits of the INT pipe is to initiate data loads for the next iteration of a loop while the FP pipe is processing the current iteration. Initiating data loads as early as possible likely leads to more efficient use of bandwidth across the memory hierarchy especially for tight loops with a high INT:FP ratio.

As you pointed out that's less helpful in other cases where the FP:INT ratio is high. Half the available register bandwidth goes to waste as there are no execution units available to issue to every other clock.

The pipeline latency on Volta/Turing also dropped to 4 cycles from 6 on Pascal. Not sure if that was also due to splitting out the INT pipe.
 
I simply cannot believe that Nvidia would go back Kepler-style while at the same time keep the register file at 64 kiB per processing block (quarter-SM). That would only increase contention, forcing more data movement in and out of registers, increasing power and blocking execution of more complex workloads. I call thus fake.
 
Ah, now I get it. Yeah that would be interesting and a relatively cheap way to increase FP throughput. It begs the question though of why go through all that trouble instead of just using Pascal style 32-wide combined INT+FP units.
What if it's actually 16 INT + 16 INT/FP32 + 16 FP32?
 
Take that Navi die, put it at 16nm and come back and ask the same question.

The 2070 also supports a whole lot more of functional units, Tensor cores, Ray Tracing cores, INT32 units ..etc, which means it does more functions at the same transistor budget as Navi, and with an older process.

TU-106 gets beat by Navi-10 in Games & by 20% in some instances. Subsequently, Navi-10 in many instances, beats the 2070 SUPER, too... (cut-down TU-102)

Turing can't compete with rdna, how will it compete with rdna2..?
 
Navi 10 doesn't beat TU106/2070 either: https://www.techpowerup.com/review/amd-radeon-rx-5700-xt/28.html

What it does however is lack all new features of Turing and run at 30% more power while using a whole node advantage compared to TU106.

It is in fact RDNA1 which can't compete with Turing and this is precisely the reason why AMD is selling these cards with a discount against corresponding Turing parts.

But whatever. I feel that this is a pointless discussion.
 
Two unknown new NVIDIA GPUs:


7552 Cuda cores (118 CUs) > 1.11GHz core clock > GB5 Compute score: 184096 (Open CL)
6912 Cuda cores (108 CUs) > 1.01GHz core clock > GB5 Compute score: 141654 (Open CL)

results are from Oct 2019, so probably engineering samples, which could explain the low clocks.

For some context :
GV100 : 142837 (Open CL)
Tesla V100 : 154606 (Open CL)
Titan RTX : 132804 (Open CL)

 
New datacenter part, as no consumer product would have that kind of amounts of memory
 
Yes, thought that's obvious from the focus on compute loads. But anyhow the Datacenter GPUs from NVIDIA are usually strongly related to the consumer ones.
 
Titan RTX scores 128509
118CU part scores 184096

https://browser.geekbench.com/opencl-benchmarks

The 118CU part is 43% faster than Titan RTX while working at almost half the clocks (1100MHz), assuming perfect clock scaling @1800MHz, that means the 118CU part can beat Titan RTX by 2.34X in this workload!

If we take into account a non perfect clocks scaling we can safely expect at least 70% faster than Titan RTX in gaming workloads, or maybe more?
 
Last edited:
This is big chip only for hpc. My guessing it’s not getting rt cores. Also gaming chip will not be 850mm^2. Also I think this chip is not going to 1800 MHz if it’s this size. It will stay maximum 1.4-1.6 ghz
 
New This is big chip only for hpc. My guessing it’s not getting rt cores. Also gaming chip will not be 850mm^2. Also I think this chip is not going to 1800 MHz if it’s this size. It will stay maximum 1.4-1.6 ghz
Gaming chips shed FP64 units which reduces die size, they can also get leaner by reducing Tensor cores count, or internal caches/registers. They could also swap HBM for GDDR6 .. repeating the situation of GP100 (600mm) vs GP102 (471mm).
 
Titan RTX scores 128509
118CU part scores 184096

https://browser.geekbench.com/opencl-benchmarks

The 118CU part is 43% faster than Titan RTX while working at almost half the clocks (1100MHz), assuming perfect clock scaling @1800MHz, that means the 118CU part can beat Titan RTX by 2.34X in this workload!

If we take into account a non perfect clocks scaling we can safely expect at least 70% faster than Titan RTX in gaming workloads, or maybe more?
It's kiss the reticle chip, so it better be 70% faster or else.
Gaming chips shed FP64 units which reduces die size, they can also get leaner by reducing Tensor cores count, or internal caches/registers. They could also swap HBM for GDDR6 .. repeating the situation of GP100 (600mm) vs GP102 (471mm).
Don't forget NVLink and all that other bulky analog jazz that eats area.
 
You mean the image which spread like wildfire despite having impossible specs and caused SK Hynix to actually make a press release about it being fake?
No, I mean the other rumor that was spread after that.

Apart from that: What amount do you think the next XBox and Playstation will have? Most likely, they will not stay at 8 GByte and for high-end Desktop, you need something more than „just what consoles have“ in order to cater to their target audience, aka PC Gaming Master Race. Otherwise, they'd feel diminished.
 
Status
Not open for further replies.
Back
Top