Speculation: GPU Performance Comparisons of 2020 *Spawn*

Status
Not open for further replies.
And how is that any different from Ampere, which requires disproportionate amounts of specific type of load to flex it's muscle?
Ampere doesn't require you to move geometry frontend onto compute to reach its peak compute throughput in modern games. You can spend this compute doing RT or something else, Ampere's frontend is fine.

That's not games, it's not "acting like 30TF GPU" in games which was the whole point of the debacle.
I'll repeat again: it does act as a 30TF GPU in games. Just not all of them since not all modern games are limited by FP32 math - even on Turing.
And what is your baseline for comparisons here? Turing? It has a different math execution pipeline. Do you count Turing's math capabilities into its "TFLOPS" number? Because they sure aren't there by default.
 
I think the opinion that nv TF are a lie will end when and comes with their 25+ TF gpus. In sure they will.
I think people tend to underestimate the knowledge which NV and AMD put into designing their respective architectures.
There's a reason why Ampere got double FP32 throughput and it's not solely "it was cheap to add".
A lot of next generation workloads will hit FP32 math specifically hard - both RT related shading and mesh shaders for example will run on FP32 and we haven't even scratched the surface of these workloads this gen.

Also I'm patently waiting for Navi 21 here which will apparently scale linearly in all resolutions and all games according to its tflops change compared to Navi 10.
 
If you backed up your arguments about Ampere's perfection with some facts it would be nice.
Who's saying anything about "perfection"? Ampere isn't perfect obviously but it's issues aren't in its flops utilization domain.

Check Mafia Definitive Edition benchmarks for example: https://www.pcgameshardware.de/Mafi...751/Specials/Benchmark-Test-Review-1358331/2/
3080 there is 94% faster than 2070S in 4K.
But we can just ignore such results of course and continue asking for facts.
 
And how is that any different from Ampere, which requires disproportionate amounts of specific type of load to flex it's muscle? That's not games, it's not "acting like 30TF GPU" in games which was the whole point of the debacle.

Did GCN ultimately scale better in more shader bound games? I thought the problem with GCN was that it was unable to fully utilize its shader hardware even in shader heavy games.

Looking at the 5700xt vs Vega 64 for example. Vega 10 specs trump Navi 10 on almost all fronts except clock speed yet loses to Navi 10 by 15-20%. GCN clearly had a utilization problem somewhere.
 
Ampere doesn't require you to move geometry frontend onto compute to reach its peak compute throughput in modern games. You can spend this compute doing RT or something else, Ampere's frontend is fine.

I'll repeat again: it does act as a 30TF GPU in games. Just not all of them since not all modern games are limited by FP32 math - even on Turing.
And what is your baseline for comparisons here? Turing? It has a different math execution pipeline. Do you count Turing's math capabilities into its "TFLOPS" number? Because they sure aren't there by default.
"Just not all of them" - seriously? Even NVIDIA can't find a game where it would even double the performance of RTX 2080S, let alone get anywhere near the 1.7x increase there's theoretically available.

Who's saying anything about "perfection"? Ampere isn't perfect obviously but it's issues aren't in its flops utilization domain.

Check Mafia Definitive Edition benchmarks for example: https://www.pcgameshardware.de/Mafi...751/Specials/Benchmark-Test-Review-1358331/2/
3080 there is 94% faster than 2070S in 4K.
But we can just ignore such results of course and continue asking for facts.
Ignore what? That "30 TF GPU" is almost double the performance of 9 TF GPU? At this rate of lowering the comparison point we'll be comparing it to original GeForce by the years end.
 
Ignore what? That "30 TF GPU" is almost double the performance of 9 TF GPU? At this rate of lowering the comparison point we'll be comparing it to original GeForce by the years end.
Postulate a question here: is there an actual performance problem that has been identified or is it a marketing problem?

Performance problem would indicate an obvious bottleneck
Marketing problem would indicate that perhaps things don't scale as marketed; ie. that they shouldn't have called it 30TF.
 
"Just not all of them" - seriously? Even NVIDIA can't find a game where it would even double the performance of RTX 2080S, let alone get anywhere near the 1.7x increase there's theoretically available.
This is starting to sound like a broken record really.
Should I repeat for the nth time that modern games are rarely FP32 math limited?
And that you have to account for separate INT pipeline in Turing in your comparisons of 2080S to Ampere?

Ignore what? That "30 TF GPU" is almost double the performance of 9 TF GPU? At this rate of lowering the comparison point we'll be comparing it to original GeForce by the years end.
Yeah, that. Ignore precisely what I've just said for nth time above, again.

ie. that they shouldn't have called it 30TF
Why not? It's 30 tflops. How exactly is it not?
 
Postulate a question here: is there an actual performance problem that has been identified or is it a marketing problem?

Performance problem would indicate an obvious bottleneck
Marketing problem would indicate that perhaps things don't scale as marketed; ie. that they shouldn't have called it 30TF.

I think it could be both. Does it perform how we expect it to perform with the known specifications? As compared to how one expects it to be able to perform based on how they're being told it should perform by marketing materials.
 
Why not? It's 30 tflops. How exactly is it not?
I agree that it is. In this case the 2000 series were undermarketed because you don't count int32 into the flops. But it's clear it had an effect, and should there not have been int32 and instead fp32 to do the same task the flop count would have been higher, but performance the same for identical workloads.
 
nVidia did marketing it. It was called RTX-Ops.
right forgot about this one: the combo of RT, Int32, Tensor, and FP32.

Did they provide RTX Ops for Ampere? Because that would seem to be a more apt comparison instead of comparing straight teraflops.

edit: no, I searched through the whitepaper and could not ctrl+f RTX-Ops. Seems to be a Turing special.

From the nvidia whitepaper here:
https://www.nvidia.com/content/dam/...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

ROP Optimizations
In previous NVIDIA GPUs, the ROPs were tied to the memory controller and L2 cache. Beginning with GA10x GPUs, the ROPs are now part of the GPC, which boosts performance of raster operations by increasing the total number of ROPs, and eliminating throughput mismatches between the scan conversion frontend and raster operations backend. With seven GPCs and 16 ROP units per GPC, the full GA102 GPU consists of 112 ROPs instead of the 96 ROPS that were previously available in a 384-bit memory interface GPU like the prior generation TU102. This improves multisample anti-aliasing, pixel fillrate, and blending performance.

I mean, it does sort of showcase that the 2xFP32 was not matched by the increase on the front end which only moved up 16%. Though the bandwidth amount was increased to match as well, once again not nearly enough to match the 2xFP32 in which, if using Teraflops as your guidestick for performance (2xTF should result in 2x benchmarks), then there just isn't enough bandwidth or ROPs to be able to pair properly with that number.

I would think the numbers that really need to be looked at are not benchmark of older titles, but benchmarks of newer titles go forward.
 
Last edited:
Does it perform how we expect it to perform with the known specifications?

No. It actually performs better than we could expect from known specifications. One does need to be able to read more than one line in such specifications tho.

As compared to how one expects it to be able to perform based on how they're being told it should perform by marketing materials.

Marketing materials include benchmarks for that purpose, and I think they were sufficiently accurate?
 
For the hell of it I'm just going to bulk copy and paste; I have no clue where that RTX IO discussion is but here is from the white paper:
https://www.nvidia.com/content/dam/...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

Introducing NVIDIA RTX IO NVIDIA RTX IO is a suite of technologies that enable rapid GPU-based loading and decompression of game assets, accelerating I/O performance by up to 100x compared to hard drives and traditional storage APIs. When used with Microsoft’s new DirectStorage for Windows API, RTX IO offloads dozens of CPU cores’ worth of work to your RTX GPU, improving frame rates, enabling near-instantaneous game loading, and opening the door to a new era of large, incredibly detailed open world games. Object pop-in and stutter can be reduced, and high-quality textures can be streamed at incredible rates, so even if you’re speeding through a world, everything runs and looks great. In addition, with lossless compression, game download and install sizes can be reduced, allowing gamers to store more games on their SSD while also improving their performance. How NVIDIA RTX IO Works NVIDIA RTX IO plugs into Microsoft’s upcoming DirectStorage API which is a next-generation storage architecture designed specifically for state-of-the-art NVMe SSD-equipped gaming PCs and the complex workloads that modern games require. Together, streamlined and parallelized APIs specifically tailored for games allow dramatically reduced IO overhead, and maximize performance / bandwidth from NVMe SSDs to your RTX IO-enabled GPU. Specifically, NVIDIA RTX IO brings GPU-based lossless decompression, allowing reads through DirectStorage to remain compressed and delivered to the GPU for decompression. This removes the load from the CPU, moving the data from storage to the GPU in a more efficient, compressed form, and improving I/O performance by a factor of two.

GeForce RTX GPUs will deliver decompression performance beyond the limits of even Gen4 SSDs, offloading potentially dozens of CPU cores’ worth of work to ensure maximum overall system performance for next-generation games. Lossless decompression is implemented with high performance compute kernels, asynchronously scheduled. This functionality leverages the DMA and copy engines of Turing and Ampere, as well as the advanced instruction set, and architecture of these GPU’s SM’s. The advantage of this is that the enormous compute power of the GPU can be leveraged for burst or bulk loading (at level load for example) when GPU resources can be leveraged as a high performance I/O processor, delivering decompression performance well beyond the limits of Gen4 NVMe. During streaming scenarios, bandwidths are a tiny fraction of the GPU capability, further leveraging the advanced asynchronous compute capabilities of Turing and Ampere.

Microsoft is targeting a developer preview of DirectStorage for Windows for game developers next year, and NVIDIA Turing & Ampere gamers will be able to take advantage of RTX IOenhanced games as soon as they become available

unfortunately, no mention of what types of compression, except that it's lossless. But it does sound like it's all done in the driver side of things, so no need to write your own compute shader to decompress it over the GPU.
 
rtx 3080 is a true 30tflops card. Just don't expect games to fully leverage it 100% of the time, just like all gpus rarely hit peak tflops. All games utilize gpus differently. Also Turing had an equal number of int32 ops to fp32 ops, so it's not an exactly doubling depending on the games ratio of fp32 to int32 in shaders. Basically people who think nvidia lied about the number of tflops just don't understand how gpus work.
 
rtx 3080 is a true 30tflops card. Just don't expect games to fully leverage it 100% of the time, just like all gpus rarely hit peak tflops. All games utilize gpus differently. Also Turing had an equal number of int32 ops to fp32 ops, so it's not an exactly doubling depending on the games ratio of fp32 to int32 in shaders. Basically people who think nvidia lied about the number of tflops just don't understand how gpus work.
Where has anyone claimed NVIDIA lied about their TFLOPS?
 
Sure let's compare RX580 at maximum boost vs non maximum Vega... Seems fair.
No, you compare both cards at their average clocks speeds because those are the clocks being used on your TPU performance comparisons.
The Vega cards were reviewed using AMD's reference blower coolers which caused thermal throttling at 85ºC. The RX580 was launched as a partner-only GPU with no reference cooler, meaning most OEMs paired it with open-air coolers that prevented thermal throttling. That said, most RX580 cards actually averaged at above 1400MHz, i.e. well above the 1340MHz "boost frequency" you used to take the 6.2 TFLOPs number in your post.
Using the reference coolers without undervolting, the average clocks for Vega 64 are close to 1400MHz after a while, and for Vega 56 it's 1300MHz.

Vega 10's less-than-expected performance came from the fact that its power/frequency curve is pretty terrible compared to Pascal, and not because the architecture is broken. If the Vega architecture was so bad, AMD wouldn't have kept using it to deliver the fastest mobile iGPUs in the market.


And it doesn't even make a difference, because clocks affect all metrics equally, so it doesn't change a thing compared to Ampere.
Except clocks affect compute and texture/pixel fillrate throughput, which you used in your post to try to prove that Vega is more broken than Ampere and worse than Polaris.
Your own comparison is what made it relevant.


Basically people who think nvidia lied about the number of tflops just don't understand how gpus work.
No one made this claim.
The claim is Ampere's very high FP32 throughput may never result in significantly higher performance in future games because the architecture isn't designed to use all that throughput in games anyway (e.g. it can't really do 30 TFLOPs unless the TMUs, ROPs, geometry processors, etc. are unused).
And just like Vega 10 was a chip developed to compete in too many market segments (gaming + productivity + compute), Ampere / GA102 might have also been developed to increase nvidia's competitiveness on non-CUDA compute workloads, which was apparently nothing to right home about compared to Vega.




The reality is that claiming "Vega / GCN5 / GFX9 architecture is broken" makes just as much sense as claiming "Games aren't ready for Ampere".
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top