NVidia Ada Speculation, Rumours and Discussion

Status
Not open for further replies.
I'm curious as to what the current bottlenecks are. Bandwidth is commonly mentioned but Nvidia GPUs have consistently scaled better with core OC as opposed to memory since at least Kepler. 3090 has 55% more compute and 83% more bandwidth while a 6900 has 30% more texel rate and 50% more pixel fill. Infinity Cache eats into the bandwidth advantage but not completely.

If you’re referring to VRAM bandwidth that is almost never a significant bottleneck across a frame. There are a few workloads where high VRAM bandwidth helps like writing out the depth or G buffers but it’s hardly the main bottleneck during shading. People say bandwidth but what they really mean is “memory access” which includes L1/L2 caches not just VRAM.

Post processing tends to be compute bound though as it’s usually crunching a few highly coherent buffers.
 
Maybe Nvidia doesn’t mind being memory latency bound in games as long as compute bound workloads benefit from the excessive flops. GA102 is a solid 30% faster than Navi 21 in those.
Curious to see the data point(s) you're using for this conclusion.
 
Oh, I thought you had data points which are known good like-for-like compute comparisons. Maybe there's something in the links you supplied, but when I see 2080Ti close to or faster than 6900XT, then I see that you're not actually referencing compute performance, but something else.
 
Oh, I thought you had data points which are known good like-for-like compute comparisons. Maybe there's something in the links you supplied, but when I see 2080Ti close to or faster than 6900XT, then I see that you're not actually referencing compute performance, but something else.

What’s an example of a like for like comparison? There’s a lot of info at those links including synthetic Sandra and Aida stuff.
 
The results certainly don't seem compute limited in the first and 3rd link. 2nd link only tests 2 GPUs so it's harder to get a clear picture of performance scaling.
 
What’s an example of a like for like comparison? There’s a lot of info at those links including synthetic Sandra and Aida stuff.
I'm doubtful NVidia is designing SMs to win Sandra and Aida to compete against AMD. Real world apps like Resolve do the job nicely already.

I thought you had a particular comparison in mind.

The fastest AMD GPU in the IndigoBench leaderboard:

IndigoBench Results | Indigo Renderer

is a 6800XT. The fastest 3090 scores 60 on the Supercar test, around 50% more than the result you linked.

"30% faster" looks like a random number, to be honest, based on the results you linked.

GA102 is more than 60% faster than Navi 21 in theoretical FLOPS (40 v 24.93 TFLOPS, 3090Ti versus 6900XTX (XTXH chip)), though I'm not sure why TechPowerup says the XTX was never released though.

But as I hinted earlier, co-issue FLOPS is not very useful, since utilisation rate is generally poor.

Utilisation per mm² or per watt is the most useful metric and we could try to talk about that in terms of Ada's SM layout.
 
I'm doubtful NVidia is designing SMs to win Sandra and Aida to compete against AMD. Real world apps like Resolve do the job nicely already.

I thought you had a particular comparison in mind.

The fastest AMD GPU in the IndigoBench leaderboard:

IndigoBench Results | Indigo Renderer

is a 6800XT. The fastest 3090 scores 60 on the Supercar test, around 50% more than the result you linked.

"30% faster" looks like a random number, to be honest, based on the results you linked.

GA102 is more than 60% faster than Navi 21 in theoretical FLOPS (40 v 24.93 TFLOPS, 3090Ti versus 6900XTX (XTXH chip)), though I'm not sure why TechPowerup says the XTX was never released though.

But as I hinted earlier, co-issue FLOPS is not very useful, since utilisation rate is generally poor.

Utilisation per mm² or per watt is the most useful metric and we could try to talk about that in terms of Ada's SM layout.

I shared links to the real world apps that we have data for. If there’s a specific workload that you think is more appropriate I would be happy to see those results too. Or you can choose from the provided data for apps that meet your criteria.

And yes the 30% was a ballpark number based on the many different results. Clearly performance will vary over different apps and workloads.
 
I shared links to the real world apps that we have data for. If there’s a specific workload that you think is more appropriate I would be happy to see those results too. Or you can choose from the provided data for apps that meet your criteria.

And yes the 30% was a ballpark number based on the many different results. Clearly performance will vary over different apps and workloads.
I think it would be more useful to talk about Ampere versus Turing, e.g. 3090Ti versus 2080Ti and how that might be relevant in Ada for games versus "prosumer compute apps".

NVidia is only competing with itself for "prosumer compute apps", so talking about Ada FLOPS for those apps (i.e. ignoring gaming) has some kind of baseline for a discussion about SM architecture changes.

In other words, discussion of FLOPS and SM architecture, referenced against AMD FLOPS, is a dead-end, NVidia doesn't need "excessive FLOPS" to win. 30% (or 300%) faster in Ampere doesn't inform a discussion of possible SM changes...
 
I think it would be more useful to talk about Ampere versus Turing, e.g. 3090Ti versus 2080Ti and how that might be relevant in Ada for games versus "prosumer compute apps".

NVidia is only competing with itself for "prosumer compute apps", so talking about Ada FLOPS for those apps (i.e. ignoring gaming) has some kind of baseline for a discussion about SM architecture changes.

In other words, discussion of FLOPS and SM architecture, referenced against AMD FLOPS, is a dead-end, NVidia doesn't need "excessive FLOPS" to win. 30% (or 300%) faster in Ampere doesn't inform a discussion of possible SM changes...

It doesn't really change the conclusion if you compare Ampere to Turing or RDNA. Ampere's advantage in compute workloads is much higher than its gaming advantage vs both Turing and RDNA. Luxmark is one of the best examples of this. The point is Nvidia doesn't seem to care about the wasted flops on games. Hence we shouldn't be surprised if Ada also "wastes" flops.
 
nVidia didnt waste anything. GA104 has 30% more transistors than TU104 and delivers >30% more performance within the same power level. Perf/mm^2 went up 90% from 16nm -> 8nm.
 
Noone mentioned power or die sizes...

Yes, but it is necessary to evaluate results and decisions. Lets compare Navi22 and GA104 with Navi10 (5700XT):
6700XT: 70% more transistors => 30% more performance
3070: 70% more transistors => 50% more performance

Besides transistors spent for DX12 ultimate, doubling FP32 thoughput within a SM does more for gaming than improving clocks and using a L3 Cache. Performance inprovement from Turing -> Ampere is in the same ballpark than Navi2 while transistor budget is smaller.
 
Yes, but it is necessary to evaluate results and decisions. Lets compare Navi22 and GA104 with Navi10 (5700XT):
6700XT: 70% more transistors => 30% more performance
3070: 70% more transistors => 50% more performance

Besides transistors spent for DX12 ultimate, doubling FP32 thoughput within a SM does more for gaming than improving clocks and using a L3 Cache. Performance inprovement from Turing -> Ampere is in the same ballpark than Navi2 while transistor budget is smaller.

We don’t know how a theoretical GA104 with less flops but more cache would perform. Given the low hardware utilization when gaming there’s a very good chance that config would be better for games. That depends of course on whether the extra flops were expensive in terms of transistor budget.
 
Well, one way to think about this is that less FP32 would allow for more SMs, which would allow for more ray tracing hardware.

As far as I can tell the Int32 pipeline in Ampere (and Turing) is full-speed on one pipe. To be frank, it mystifies me why such high throughput is required for Int32. There are some bitwise operations that occur on this pipe (simple things like AND or more complex things like finding the first bit set to 1), but how much intensive use is there?

Having implemented this in Turing, NVidia then added FP32 to this pipeline to make Ampere.

Now it could be that this very high Int32 throughput is crucial to NVidia's ray tracing performance and so the nudge to make if full FP32 in Ampere isn't seen as so difficult. e.g. there may be crucial sorting or bounds-checking type operations in NVidia's ray tracing architecture that depend on high Int32 throughput.

Remember that with a 32-wide hardware thread, an Int32 operation can map each bit to a single work item. So there's a natural fit here, and it may be deeply entwined with ray tracing.

It may actually be specific to BVH-building, used in doing bounding-volume addressing math, comparison-result bit-field masks and erm, so on...

I'm just throwing ideas out there as to why there appears to be an excessive throughput for Int32.

Int24 math, because it effectively corresponds with the mantissa of FP32 math, is pretty trivial to make full-speed in GPUs. So what's interesting is that the slower partial results that Int24 or Int16 math instructions would produce are not enough: NVidia added full-speed Int32. Shrug.
 
Status
Not open for further replies.
Back
Top