Nvidia's 3000 Series RTX GPU [3090s with different memory capacity]

in some gaming scenarios

I think what most ment was this comment.

but if you're actually interested in truth, then the RTX 3008 is not a 30 TFLOP GPU

The 3080 for example is a 30TF GPU. If one is intrested in the truth, one could dig a little deeper then just saying 'its not the truth, its a lie, we are being misled etc'.
The post i linked to in my previous comment seems to sum things up quite well. No TF dont mean everything, theres more to it obviously, but that doesnt say they dont mean anything either.

https://www.gamespot.com/forums/sys...zealous-33515692/?page=1#js-message-356853198
From user 04dcarraher

''... FLOPS mean nothing when your comparing totally unrelated different gpu architectures..... Now if you were to compare say GCN 1.0 vs GCN 1.4 based gpu's then I would say "maybe" since they are still based on the same core design.

RX 6800xt (20 TFLOPS) has a pixel rate is 288 GPixel/s, texture rate of 648.0 GTexel/s

The RTX 2080ti(13.45 TFLOPS) has a pixel rate of 136.0 GPixel/s and a texture rate of 420.2 GTexel/s. Yet it still beats the 6800xt in RT but loses to the 6800xt in normal rasterization rendering performance.

While the RTX 3080( 29.77 TFLOPS) has a pixel rate of 164.2 GPixel/s and a texture rate of 465.1 GTexel/s. and beats the 6800xt overall and has much much better RT performance. The TFLOP increase is because of the 128 core per SM design vs Turing's 64 core design. Hence the 2x potential of math crunching. However half of the 64 cores of the 128 is allocated for INT and or more types of FP ie 8/16/32 etc base on the type of job. Making the gpu more flexible.

RX 6800 series is a great performing gpu when it comes to normal rasterization rendering. But falls flat on its face when it has to do it all, with RT and or high resolutions. The way the RDNAv2 does alot of its RT work is by using its free TMU(texture mapping units) "what gives the gpu its texture rate". So the higher the texture/pixel resolution and the amount of RT is used eats into the RDNA's TMU resources hurting performance.

While the design might be more flexible where you "could" allocate more TMU's for RT,adjusting the amount while Nvidia RTX design is a fixed amount of dedicated RT processors. But the fact that AMD is using a gpu's TMU's was a short cut for them to check mark "the does it have RT" box.''
 
And what's happening on RDNA2 when there's the same INT execution happening anywhere on any WGP?
Other than that, by NVIDIA's Turing whitepaper, there are 36 INT instructions per 100 FP instructions on average. This leaves up RDNA2 with 73,53 % peak theoretical FP efficiency at such instructions mix due to the 1 to 1 FP and INT units split.
̶W̶i̶t̶h̶ ̶A̶m̶p̶e̶r̶e̶,̶ ̶u̶n̶i̶t̶s̶ ̶s̶p̶l̶i̶t̶ ̶i̶s̶ ̶2̶ ̶F̶P̶ ̶t̶o̶ ̶1̶ ̶I̶N̶T̶ ̶S̶I̶M̶D̶ ̶u̶n̶i̶t̶s̶,̶ ̶b̶e̶t̶t̶e̶r̶ ̶e̶f̶f̶i̶c̶i̶e̶n̶c̶y̶ ̶i̶s̶ ̶p̶o̶s̶s̶i̶b̶l̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶3̶6̶ ̶I̶N̶T̶ ̶p̶e̶r̶ ̶1̶0̶0̶ ̶F̶P̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶ ̶m̶i̶x̶,̶ ̶8̶6̶,̶7̶9̶ ̶%̶ ̶p̶e̶a̶k̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶ ̶F̶P̶ ̶e̶f̶f̶i̶c̶i̶e̶n̶c̶y̶ ̶s̶i̶n̶c̶e̶ ̶I̶N̶T̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶s̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶e̶x̶e̶c̶u̶t̶e̶d̶ ̶i̶n̶ ̶p̶a̶r̶a̶l̶l̶e̶l̶ ̶w̶i̶t̶h̶ ̶F̶P̶ ̶(̶w̶h̶i̶c̶h̶ ̶l̶e̶a̶v̶e̶s̶ ̶t̶h̶e̶ ̶a̶d̶d̶i̶t̶i̶o̶n̶a̶l̶ ̶1̶3̶,̶2̶6̶ ̶%̶ ̶o̶f̶ ̶F̶P̶ ̶p̶e̶r̶f̶o̶r̶m̶a̶n̶c̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶t̶a̶b̶l̶e̶)̶.̶
̶L̶e̶t̶s̶ ̶c̶h̶e̶c̶k̶ ̶n̶u̶m̶b̶e̶r̶s̶ ̶o̶n̶ ̶s̶u̶c̶h̶ ̶s̶p̶l̶i̶t̶ ̶-̶ ̶6̶8̶0̶0̶ ̶X̶T̶ ̶=̶ ̶2̶1̶ ̶T̶F̶L̶O̶P̶S̶*̶0̶.̶7̶4̶ ̶+̶ ̶2̶1̶ ̶T̶O̶P̶S̶*̶0̶.̶2̶6̶ ̶=̶ ̶2̶1̶ ̶t̶r̶i̶l̶l̶i̶o̶n̶s̶ ̶o̶f̶ ̶m̶i̶x̶e̶d̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶ ̶p̶e̶r̶ ̶s̶e̶c̶o̶n̶d̶,̶ ̶3̶0̶8̶0̶ ̶=̶ ̶3̶0̶*̶0̶.̶8̶7̶ ̶+̶ ̶1̶5̶*̶0̶.̶2̶6̶ ̶=̶ ̶3̶0̶ ̶t̶r̶i̶l̶l̶i̶o̶n̶s̶ ̶o̶f̶ ̶m̶i̶x̶e̶d̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶ ̶p̶e̶r̶ ̶s̶e̶c̶o̶n̶d̶.̶
My bad, numbers above are partially wrong for 3080, everything is simpler - 30 tflops is peak, so 30*0.74 = 22.2 FP TFLOPs and 7.8 TOPS are required for the 100 FP / 32 INT split. Still 30 TOPS in total, but doesn't break instructions percentages this time.
If there were shaders with just integer instructions, then RDNA2 would win, but shaders contain different instructions. And Ampere's 2:1 FP / INT units split is better for the 100 FP / 36 INT instructions mix.
 
Last edited:
INT32 throughput on AMD hardware is half speed.
Screenshot-2021-01-18-190134.png

https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
Page 13
 
I think sniffy was just trying to explain to ToTTenTranz why in some gaming scenarios the 20TFLOP RDNA2 is equal to the 30TFLOP Ampere and simply asked to look past the basic marketing numbers that most of us here don't adhere to anyway. We all know TFLOPS isn't everything.

I don't think there was a need to bring in the NDS.

Actually he made a clearly false assertion which was corrected as it should be on a technical forum. No need to defend it.

If you want to claim Ampere isn’t really a 30 TFLOP GPU while running INT instructions you can’t claim that RDNA is a 20 TFLOP GPU in the same breath.
 
Hm, maybe its limited artificially in consumer products for some reason?

Professional cards have the same pattern, e.g. https://www.servethehome.com/amd-radeon-pro-w5700-gpu-review/3/

The white paper you took the exerpt from explicitly lists the various FP instructions as being full rate, then follows with "as well as 24/32-bit integer". The implication there I believe is simply that the execution units handle them, but not necessarily at full rate.
 
Last edited by a moderator:
What we really need is a mixed operation test (preferably configurable). As I wonder what the actual throughput is for the FPU/ALU are in mixed workloads with various FP16/FP32/INT32/etc. operations for all uarchs.
 
Actually he made a clearly false assertion which was corrected as it should be on a technical forum. No need to defend it.

If you want to claim Ampere isn’t really a 30 TFLOP GPU while running INT instructions you can’t claim that RDNA is a 20 TFLOP GPU in the same breath.

In my recollection NVIDIA has always done better in-game with less FLOPs compared to AMD...so indeed FLOP's do not tell the full picture.
As far as I remember, NVIDIA has been better at keeping it's "pipeline full", hence why the "async computing" initial didn't do as much for them as for AMD.
 
There's no single magic number that's the be-all-end-all of performance metrics. I'd love to be proven wrong though.

Nope there isn't, even within the same arch. And it's hopeless when comparing different architectures. Take the 3080 vs 3060 Ti for example.

The 3080 is 40-60% faster depending on the game, averaging around 50% faster. But the 3080 has...

84% more flops and texture throughput
70% more bandwidth
24% more fillrate

The advantage in games never drops as low as 24% and never rises as high as 84%. None of the top line theoretical numbers predicted the actual 50% gain.
 
Actually he made a clearly false assertion which was corrected as it should be on a technical forum. No need to defend it.

If you want to claim Ampere isn’t really a 30 TFLOP GPU while running INT instructions you can’t claim that RDNA is a 20 TFLOP GPU in the same breath.


Exactly. Ampere has no "free" INT32 ALUs like Turing did and that's exactly like RDNA2, so I don't get sniffy's point. Both metrics are 100% comparable as they both relate to theoretical maximum TFLOP output, regardless of them being used by marketing departments or not.
No GPU ever really reaches its theoretical maximum TFLOP output, much less when rendering games, but that's a fact that everyone here should indeed take into account.


Regardless, the tables have turned and, RDNA2 is now an architecture with better rasterization performance-per-theoretical-max-TFLOP than Ampere. Which is as much of a completely worthless metric now as it was when Pascal had a much higher ratio than Vega IMO.
The only thing that really matters is performance/price (where we deal with die size, process node, yields, PCB and memory cost, margins, etc.) and to a lower degree of importance performance/power.




There's no single magic number that's the be-all-end-all of performance metrics. I'd love to be proven wrong though.
According to some, the be-all-end-all performance metric is Cyberpunk 2077 with raytracing options maxed out at 4K with DLSS.
Nothing else matters anymore. :runaway::runaway:
 
The white paper you took the exerpt from explicitly lists the various FP instructions as being full rate, then follows with "as well as 24/32-bit integer". The implication there I believe is simply that the execution units handle them, but not necessarily at full rate.
This is certainly a possibility but would be pretty skittish way of describing the spec in a whitepaper on AMD's part. INT24 seem to run at full rate at least.

The only thing that really matters is performance/price (where we deal with die size, process node, yields, PCB and memory cost, margins, etc.) and to a lower degree of importance performance/power.
Actually, the only two things which matter are perf/watt and perf/transistor. The latter is also where you should account for features absent from competition. All the rest are a result of these two.
 
Actually, the only two things which matter are perf/watt and perf/transistor.

They're not.
If your 5nm SoC is getting the same perf/watt and perf/transistor as the competition's SoC built on 28nm, then your product is obviously weaker and less able to compete.
Besides, the perf/transistor metric is a bit worthless considering it's variable within the same process, as chip designers can select between performance-optimized transistors and density-optimized ones.
 
Exactly. Ampere has no "free" INT32 ALUs like Turing did and that's exactly like RDNA2, so I don't get sniffy's point. Both metrics are 100% comparable as they both relate to theoretical maximum TFLOP output, regardless of them being used by marketing departments or not.
No GPU ever really reaches its theoretical maximum TFLOP output, much less when rendering games, but that's a fact that everyone here should indeed take into account.


Regardless, the tables have turned and, RDNA2 is now an architecture with better rasterization performance-per-theoretical-max-TFLOP than Ampere. Which is as much of a completely worthless metric now as it was when Pascal had a much higher ratio than Vega IMO.
The only thing that really matters is performance/price (where we deal with die size, process node, yields, PCB and memory cost, margins, etc.) and to a lower degree of importance performance/power.





According to some, the be-all-end-all performance metric is Cyberpunk 2077 with raytracing options maxed out at 4K with DLSS.
Nothing else matters anymore. :runaway::runaway:

Still waiting for your reply...but I guess no reply is all the reply I need ;)
 
RDNA2 performs int32 additions at full rate, or an int24*int24+int32 multiply-add (with the multiplication result being int32). An int32 mul requires four smaller mul/mads.

EDIT: And a substantial share of array indexing calculations, I imagine, can be proven by the compiler to fall into the int24*int24 range (often one side being a constant and the other a bounded loop index).
 
Last edited:
A GPU which has higher perf/transistor will win in perf/price and features.

What if one GPU is using high performance libraries and another high density? What if the HP library GPUs are made on a high contested (and therefore expensive) node, while the high density design is on a less performant (and thus cheaper per transistor) node? I think "performance per transistor" goes out the door as a relevant metric when one GPU designer is laying out 40-50 MT/mm^2 while the other is getting 65 MT/mm^2 on the same node.

Performance per die cost is the obvious metric, with the minor problem that the few people who know the cost for sure aren't in a position to tell anybody.
 
Last edited:
Back
Top