The nvidia future architecture thread (G100/GT300 and such)

CarstenS

Moderator
Moderator
Legend
Supporter
In this thread I'd love to see a discussion about what Nvidia needs to (or can, mind you) do to catch up in terms of performance leaps over GPU generations.

In other words: If Ati continues at a pace like this, nvidia will be in for a very hard time by the next refresh. What can they possibly do about it architecture wise?


--

Me, I think, they've invested a bit too much in granularity and also their texturing power - as much as I love the almost free and good 16x aniso - seems to be a bit over the top.

Obviously, using more modern tech, AMDs RV770 at sufficient clockspeeds oftentimes is not only game but a fair match for even the GTX280 despite the latter using about 50% more transistors on more than twice the die size (allegedly).

So, what could Nvidia do?
 
It's really hard to say what NV will do this early. But they have two fundamental options:

1. Continue to build monolithic chips for the high end
2. SLI on a stick for the high end

The path they pick will affect the architectural decisions they make during chip design. But in general I would expect to see vector based ALUs to increase perf/mm^2 and an increase in math:tex ratio.
 
The 512-bit memory bus curse distorts the picture really badly with GT200, I reckon.

With 256-bit GDDR5, 128GB/s+, and half the ROPs running at, say, 750MHz+ I imagine GT200b would perform better than GT200 - though the cache per ROP partition would prolly need to be doubled to retain the overall cache amount.

The ROPs + MCs + memory IO consumes about 28-29% of the die according to Arun. If halved that would be quite a significant gain. Though GDDR5 would be more costly in area I imagine. So, say, 60%?... A saving of about 17% of the die?

I still think NVidia could get away with only 64 TMUs at this kind of performance level (saving 5% of the die). So 8 clusters.

Then, increase the ALU:TEX ratio some more, i.e. 4 SIMDs to retain ALU capability and Bob's your uncle. 32 SIMDs in 8 clusters versus 30 SIMDs in 10 clusters prolly isn't much of a reduction in die space, though.

Alternatively keep 3 SIMDs per cluster and rely on ALU clocks in the region of 1.6GHz? That would save another few percent of die space?

Overall, this seems to be a die saving of ~25%. Shrunk to 55nm, perhaps somewhere around 20% bigger than RV770 (310mm2)?

Is AMD likely to refresh RV770, e.g. RV780? Would that be ~20% more capable, e.g. 12 SIMDs? +10% clocks?

Overall I think the size of GT200 meant that clocks were considerably lower than what should transpire with GT200b. In other words I think GT200 gives a distorted picture of NVidia's future - it inclines us to think too low.

If GT200b comes in at around 55% of the die size of GT200 allowing NVidia to seriously boost clocks, RV770 just won't be in the same picture.

As a base for discussion of GT300, I think GT200b should be the real deal - GT200 is misleading - though prolly not to the same degree that R600 misled on what RV770 would be, though :devilish:

Apart from that I think NVidia needs fine-grained redundancy in the ALUs. Turning off clusters just seems clumsy. When you're trailing ATI's peak compute density by ~35% and when that includes 6% redundancy in ATI and your GPUs will, for other reasons, be bigger - fine-grained redundancy just looks like low-hanging fruit.

Jawed
 
I guess the biggest question is whether GT200's lower clocks and per-mm2 performance compared to G92 along with its delay were a result of monolithic design or some other factors.

Monolithic design is obviously the best choice for scalability. It's just a matter of whether that advantage is worth the extra design effort, tapeout cost, and drain on resources that could be used for larger parts of the market.
 
Jawed, a simpler way of looking at your suggestions is just taking G92b and adding GDDR5 and more math per cluster (24 or 32 SIMDs instead of 16).

I don't see how it would become faster than RV770, and it'll definitely be substantially larger. Sure, these changes help in math heavy applications, but those are the apps where G92b falls furthest behind RV770 anyway.
 
Jawed, a simpler way of looking at your suggestions is just taking G92b and adding GDDR5 and more math per cluster (24 or 32 SIMDs instead of 16).
True enough. If we estimate that the SIMDs in G92b are 20% of the die, then 24 SIMDs with 10% die size increase + say 5% for GDDR5 puts G200b at around 300mm2.

I don't see how it would become faster than RV770, and it'll definitely be substantially larger.
It'd be faster in the same way that GTX280 is faster - and that's before getting a clock boost.

Sure, these changes help in math heavy applications, but those are the apps where G92b falls furthest behind RV770 anyway.
Don't forget the prodigal MUL - missing from G92b - distorting any performance scaling argument that uses G92b as a base.

ATI's architecture needs to be rated at ~75-80% utilisation in static (no DB) code it seems. Although, to be fair, the interpolation of attributes does present a significant overhead on NVidia's architecture - one that hasn't really been quantified. Hmm...

Jawed
 
Jawed, a simpler way of looking at your suggestions is just taking G92b and adding GDDR5 and more math per cluster (24 or 32 SIMDs instead of 16).

I don't see how it would become faster than RV770, and it'll definitely be substantially larger. Sure, these changes help in math heavy applications, but those are the apps where G92b falls furthest behind RV770 anyway.

With Nvidia's stuff it all comes down to clocks. An 8-cluster 256 shader, GDDR5 G9x variant at G92b clocks should run circles around RV770. It'll be considerably bigger than RV770 though and there's still the challenge of RV770's higher texturing and AA efficiency to overcome.
 
With Nvidia's stuff it all comes down to clocks. An 8-cluster 256 shader, GDDR5 G9x variant at G92b clocks should run circles around RV770.
I'm not so sure. In shaders that aren't so simple that they're fillrate limited, even the 625MHz RV770 is about even with the GTX 280 (except for two shaders where only the PS4.0 version runs badly on ATI's GPUs for some reason):
http://www.ixbt.com/video3/rv770-part2.shtml

Besides, as we learned from R580, math alone can only do so much. Heck, even doubling math and texturing doesn't do too much, as we see with G94 vs. G92.

I'd expect 10-15% more speed overall in games when doubling G92's math speed. Along with GDDR5 it would probably approach the 4870 in speed, but would need at least 25% more die space. That's worth it, IMO, but doesn't improve NVidia's perf/$.
 
Is AMD likely to refresh RV770, e.g. RV780? Would that be ~20% more capable, e.g. 12 SIMDs? +10% clocks?

No, it's due to have a 6-month life cycle on the high end, before R800 Q1 next year. Maybe a 40nm refresh next year for the mid-end.
 
What NVIDIA needs now is the new equivalent of G80->G92: a die shrink with half the bus width, GDDR5, 55nm, and much higher clock speeds.

However I still think that RV770 would beat this at high AA.

On a related note, do we know yet how RV770 performs so well at 4xAA and 8xAA? Do you think it might be some kind of new compression algorithm?
 
GT200 is still faster than RV770. All NV needs to do is make it small enough to sell cheaper and put two on a die.

A die shrink, GDDR5 with a 256bit bus, remove the DP unit and maybe half the ROPs (at a higher clock speed).

That may be enough to bring the cost close enough in line with RV770 to match or exceed its price/performance.

Then as long as they can manage to squeeze 2 of them on a board they should be able to keep the overall performance advantage as well.

Its no short order of course and I don't want to take anything away from the amazing achievement that R7xx is but it shouldn't be forgotten that GT200 is still the faster chip. ATI only has a performance advantage by using 2 chips vs 1. Credit were its due though, because thats not possible for NV atm.
 
Besides, as we learned from R580, math alone can only do so much. Heck, even doubling math and texturing doesn't do too much, as we see with G94 vs. G92.
That's usually a bandwidth/fillrate issue though, something that's identical on those two GPUs. There are times when G92 is 50%+ faster per clock and framerate minima are often where the real value lies.

I'd expect 10-15% more speed overall in games when doubling G92's math speed. Along with GDDR5 it would probably approach the 4870 in speed, but would need at least 25% more die space. That's worth it, IMO, but doesn't improve NVidia's perf/$.
GTX280 comfortably leads RV770 in games with AF/MSAA off - I don't think math is a useful basis for comparisons here.

RV770's per-unit and per-mm2 AF/MSAA performance seems to be the key.

I think NVidia's route to performance/$ or performance/mm2 is by cutting back on the excessive unit counts of TMUs and ROPs. GT200's TMUs are more efficient than G92's. The ROPs might get a new lease of life with a redesign for GDDR5 - regardless they've been choked by GDDR3 bandwidth, so should bounce back somewhat.

Jawed
 
No, it's due to have a 6-month life cycle on the high end, before R800 Q1 next year. Maybe a 40nm refresh next year for the mid-end.
Hmm, you appear to be saying that R800 will be a 55nm GPU (presumably in this case two on one board). Presumably the die can't get any smaller than RV770 on 55nm for a 256-bit bus - unless they radically reduce power consumption?

Jawed
 
That's usually a bandwidth/fillrate issue though, something that's identical on those two GPUs. There are times when G92 is 50%+ faster per clock and framerate minima are often where the real value lies.
That's precisely my point. When setup, fillrate, and BW are equal, game performance isn't affected as much as you'd expect by having double the ALU and TEX. Adding only ALUs will have an even smaller effect.

GTX280 comfortably leads RV770 in games with AF/MSAA off - I don't think math is a useful basis for comparisons here.

RV770's per-unit and per-mm2 AF/MSAA performance seems to be the key.
Again, exactly my point. I don't think adding math to G92b (or, almost equivalently, chopping everything else from GT200) will do much.

I think NVidia's route to performance/$ or performance/mm2 is by cutting back on the excessive unit counts of TMUs and ROPs.
You do realize that a few sentences before this you attributed G94's speed to its ROPs/BW, right? ;)

If 16 ROPs are useful to a 4 cluster GPU, you can't say 32 ROPs are excessive for a GPU with 10 even faster clusters. You can be sure that GT200 would take a hit with half the ROPs, and likewise RV770 would be notably faster with double the ROPs.

There's no easy fix here. These adjustments that you're suggesting will change perf/$ by a few percentage points at best. NVidia didn't really make any mistakes in the balance between the execution units. RV770 simply raised the bar on how good each part of a GPU -- TMU, ROP, MC, ALU, TEX, scheduling, etc -- can be for a given silicon budget.
 
So Mint, what's your theory for where GT200 needs to be improved upon relative to RV770? You've already eliminated math and texturing.....
 
The main things I see. Are better memory management ((So 512 cards dont die as quickly)) Increased shader clocks ((I really do feel they have plenty of pixel/texture fillrates)) and perhaps improved AA cycles through ROPS at 8x MS ((not all that important to me.))
 
So Mint, what's your theory for where GT200 needs to be improved upon relative to RV770? You've already eliminated math and texturing.....
I think you misinterpreted my comments. I'm saying that's the problem doesn't lie in the balance of ROPs, math, texturing, etc. The balance is fine, and no single aspect is disproportionately sized for the improvement it brings, IMO.

The problem is, again, that RV770 is just too good in all those areas. Throw in the worse memory management than ATI and low clocks of GT200 and you have a much worse product from a perf/$ standpoint.

Again, there's no easy fix. NVidia will have to improve everything if they want to get their margins up to the level they used to be at. It's not only areal efficiency, but aside from the ALUs they need better per-cycle efficiency too. Doing both is tough, and the only reason ATI was able to do it was the mediocrity of their previous design.
 
The key imho would be not to overdo things like Physix and GPGPU ideas.

It is all nice and well but in the end game performance is what sells cards to gamers.
 
Back
Top