ELSA hints GT206 and GT212

GT212 isn't coming anytime soon and they need something to counter RV770X2 with.
On the other hand, 384 SPs and 96 TMUs (presuming they're going 32/8 TCPs for GT21x) on 40nm is nothing to shout about. Such chip should end up being in the same league as G92 die size.

Purely theoretically and only in terms of ALU throughput with a decent ALU frequency it could end up close to or slightly over 2 TFLOPs. You mean G92@65nm? If yes nothing to disagree (and all of the above is of course 100% speculative).

I wouldn't count on any changes to the basic building blocks until DX11 GT300 or whatever they'll call it.
But since they'll still need to change the ROPs to support GDDR5 there is a possibility that they'll 'fix' MSAA 8x performance in their GDDR5 cards...

If the problem lies in the triangle setup (its a rumour that's circulating, I've no idea as a layman if it even makes any sense), then the answer is probably no. Pardon me but what's the big deal about 8xMSAA anyway? Personally if I'd have a SLi or Crossfire system I'd still go for the highest possible resolution and I have severe doubts that you end up in the majority of cases with playable performance with more than 4xMSAA samples. And no before anyone else says there's really no good excuse either for NV not to have as fast performing 8x sparse MSAA as AMD (since its my understanding it should require only two cycles for 8xMSAA), but what hardware engineers always say (and they're essentially right) is that you can never get everyone satisfied.

Personally give me a combination of coverage sampling with fast performing edge detect custom filter AA in the future (always on top of at least 4xMSAA) and I'll be a much happier user than with ordinary 8xMSAA. Box filters won't cut it for very long and that of course always IMHLO.
 
I don't know what ATI's and nVidia's ROPs are capable of, but I think it's not comparable. If 16 were enough for a chip like the GT200, why would they put 32 in there? (well maybe I know why, but since it's a bit contradictory to one of your theories, I'd like to hear your answer :) )

I thought that one was bleedingly obvious. When an IHV intends to reach X amount of bandwidth, refuses to use GDDR4 for its own reasons and is trying to play safe with ram availability as another GDDR5 adopter, then the only other alternative is GDDR3. Now in order to achieve bandwidth X with GDDR3 you use as much buswidth necessary to reach that magical "X" goal.

Since on recent NV architectures the ROP partitions are bound to each memory channel meaning that for each 64bit channel you get 4 ROPs.

G80 = 6*64bits = 6*4 ROPs = 24 ROPs
G92 = 4*64bits = 4*4 ROPs = 16 ROPs
GT200 = 8*64bits = 8*4 ROPs = 32 ROPs

As for each side's capabilities there's plenty of public data available and especially here on B3D starting from the first G80 analysis.
 
Last edited by a moderator:
You mean G92@65nm?
Yes.

If the problem lies in the triangle setup (its a rumour that's circulating, I've no idea as a layman if it even makes any sense), then the answer is probably no.
I have doubts about 8x MSAA being 'fixed' before DX11 chips appear myself. Whatever the problem is they probably won't mess with it in what essentially is a refresh and not a new architecture.

Pardon me but what's the big deal about 8xMSAA anyway?
Well, the drop in itself isn't normal. One can see this issue as a problem of the architecture. And a problem is to be solved regardless of wether it's a major or a minor one.
Another point to consider is that MSAA 8x is the de-facto highest possible common denominator between NV and AMD and people will always use it in benchmarks and other comparisions. I've already told many times that using MSAA 8x on NVs hardware makes no sense -- 16xQ CSAA provides better quality with the same performance and if you're getting low performance then you should use 16x CSAA which has 4x MSAA performance but substantially better quality (although not as good as 8x MSAA of course). But since AMDs hardware has no comparable modes everyone will use MSAA 8x in their tests -- and NV will lose agian and again until they 'fix' this problem with 8x mode.

Personally if I'd have a SLi or Crossfire system I'd still go for the highest possible resolution and I have severe doubts that you end up in the majority of cases with playable performance with more than 4xMSAA samples.
I'm using GTX280 with 1920x1200 display and up until now i've found two applications in which 8x MSAA performance isn't enough -- Crysis and Clear Sky =)
On the 4870(X2, since i have X2 but not 4870) 8x MSAA is essentially 'free' since the drop in performance from 4x is almost non-existant. Why wouldn't you use it if it's almost 'free'?
And SLI and CF setups are pretty pointless right now if you don't have 30" display with 2560x1600 resolution.

Personally give me a combination of coverage sampling with fast performing edge detect custom filter AA in the future (always on top of at least 4xMSAA) and I'll be a much happier user than with ordinary 8xMSAA. Box filters won't cut it for very long and that of course always IMHLO.
Well the problem is that ED CFAA is very slow even on RV770. As far as i can tell 4x MSAA + ED is slower than 8x MSAA while the quality of edge AA is comparable. So for now AMD is mostly stuck with MSAA 8x as the 'best' mode from quality/performance point of view, while NV is stuck between slow and nice 16xQ CSAA and fast and not-so-nice 16x CSAA.
 
Well, the drop in itself isn't normal. One can see this issue as a problem of the architecture. And a problem is to be solved regardless of wether it's a major or a minor one.
Another point to consider is that MSAA 8x is the de-facto highest possible common denominator between NV and AMD and people will always use it in benchmarks and other comparisions. I've already told many times that using MSAA 8x on NVs hardware makes no sense -- 16xQ CSAA provides better quality with the same performance and if you're getting low performance then you should use 16x CSAA which has 4x MSAA performance but substantially better quality (although not as good as 8x MSAA of course). But since AMDs hardware has no comparable modes everyone will use MSAA 8x in their tests -- and NV will lose agian and again until they 'fix' this problem with 8x mode.

Agreed; yet more to it later on.

I'm using GTX280 with 1920x1200 display and up until now i've found two applications in which 8x MSAA performance isn't enough -- Crysis and Clear Sky =)
On the 4870(X2, since i have X2 but not 4870) 8x MSAA is essentially 'free' since the drop in performance from 4x is almost non-existant. Why wouldn't you use it if it's almost 'free'?
And SLI and CF setups are pretty pointless right now if you don't have 30" display with 2560x1600 resolution.
I've still a 21" high end CRT that reaches 2048*1536*32@75Hz but that's besides the point. And no 8xMSAA isnt' for free at resolutions a SLi/CF setup would actually make sense.

Well the problem is that ED CFAA is very slow even on RV770. As far as i can tell 4x MSAA + ED is slower than 8x MSAA while the quality of edge AA is comparable. So for now AMD is mostly stuck with MSAA 8x as the 'best' mode from quality/performance point of view, while NV is stuck between slow and nice 16xQ CSAA and fast and not-so-nice 16x CSAA.
Bingo and that's what I actually had in mind for the first paragraph. For both IHVs their best available AA modes are way too slow to use due to (different) hardware limitations. AMD could have fixed their edge detect CFAA performance too for RV770 at least. If that would had been the case then the latter would have had a far bigger ace in its sleave than it already has. But as I said above you can never have it all and you can never make them all happy.

***edit: almost forgot...

Well, the drop in itself isn't normal. One can see this issue as a problem of the architecture. And a problem is to be solved regardless of wether it's a major or a minor one.

Let's assume that the problem is the limited triangle setup. The necessary changes sound to me like they wouldn't be achievable in refreshes of the same generation. Any IHV would weigh out the R&D resources necessary for any such changes and would simply conclude that it ain't worth bothering for now. It probably isn't much different with AMD's edge detect modes either.
 
Bingo and that's what I actually had in mind for the first paragraph. For both IHVs their best available AA modes are way too slow to use due to (different) hardware limitations.
8xQ/16xQ modes are quite usable on GTX280. It's not like they're slowing everything to an unplayable state. Most games today are console 'ports' and GTX280 is more than enough to maintain playable framerate with 8xQ/16xQ in 1920x1200 in these.
'24x' ED CFAA on the other hand is slow to the point where even HL2 becomes unplayable at 1920x1200 but it's arguably better at edge AA than 16xQ. So it's not really fair to say that highest AA modes from both vendors are too slow. 8xQ/16xQ from my point of view are quite usable most of the time on GTX280 in 1920x1200.

AMD could have fixed their edge detect CFAA performance too for RV770 at least. If that would had been the case then the latter would have had a far bigger ace in its sleave than it already has. But as I said above you can never have it all and you can never make them all happy.
I'm not really sure that ED CFAA performance is something that can be 'fixed' since it's a shader which will always lower performance if not done in some dedicated HW which would be a strange way to go forward.

Let's assume that the problem is the limited triangle setup. The necessary changes sound to me like they wouldn't be achievable in refreshes of the same generation. Any IHV would weigh out the R&D resources necessary for any such changes and would simply conclude that it ain't worth bothering for now. It probably isn't much different with AMD's edge detect modes either.
That's why i'm not expecting any changes in MSAA on future GT2xx chips. GT3xx (or G100 or whatever) -- maybe, i hope.
 
Someone will knock us over the head anytime soon, since the OT has gotten too long *cough*

8xQ/16xQ from my point of view are quite usable most of the time on GTX280 in 1920x1200.

I think you're missing my point; if you could have the dilemma between 1920*1200 + 8xMSAA and 2560*1600 + 4xMSAA you'd most likely prefer the latter. Frankly in any application a 280 can still handle 8xQ or higher I'd on the other hand use something like 16xS instead. Performance differences are small but you gain at least by -1.0 LOD sharper textures.

To me there aren't any absolutes when it comes to AA with the variety of modes both IHVs offer.

Finally it goes without saying that both IHVs will probably come up with "new ideas" for their true next generation. And by the way IMHLO it would be a pretty bad idea for any future D3D11 architecture to be not capable of 1 Tri/clock.
 
Well the problem is that ED CFAA is very slow even on RV770.

I have yet to test with monsters like Crysis or Clear Sky - for which i doubt it to be advisable to enable anything more than the simplest MSAA at all - but from my testing with older games, I've found ED quite well performing - it actually is oftentimes faster than Tent-filter-AA in Need for Speed Carbon for example.
 
I thought that one was bleedingly obvious. When an IHV intends to reach X amount of bandwidth, refuses to use GDDR4 for its own reasons and is trying to play safe with ram availability as another GDDR5 adopter, then the only other alternative is GDDR3. Now in order to achieve bandwidth X with GDDR3 you use as much buswidth necessary to reach that magical "X" goal.

Since on recent NV architectures the ROP partitions are bound to each memory channel meaning that for each 64bit channel you get 4 ROPs.

G80 = 6*64bits = 6*4 ROPs = 24 ROPs
G92 = 4*64bits = 4*4 ROPs = 16 ROPs
GT200 = 8*64bits = 8*4 ROPs = 32 ROPs
You're right, it is bleedingly obvious. Each memory channel is paired with a block of four ROPs. But why? You're saying that it should be easy for nVidia to rebalance the TPCs to 32 SPs with 8 TUs. At the same time, you're saying that they'll probably introduce GDDR5 support, which means reworking the mem. controller/ROP part, but suddenly it's impossible to have four channels with 8 ROPs each? ATI has done something similar with RV670, although I'm not sure whether they changed the number of channels (R600 had 8× 64bit) or just narrowed the channels to 32 bits, but that doesn't matter much.
'24x' ED CFAA on the other hand is slow to the point where even HL2 becomes unplayable at 1920x1200 but it's arguably better at edge AA than 16xQ. So it's not really fair to say that highest AA modes from both vendors are too slow. 8xQ/16xQ from my point of view are quite usable most of the time on GTX280 in 1920x1200.
You can't compare 16×Q CSAA to 24× ED CFAA. 16×Q looks just a bit better than plain 8× with a box filter, the coverage samples don't really do much. The edge-detect filter is a performance hog, but provides the best quality.
That's why i'm not expecting any changes in MSAA on future GT2xx chips. GT3xx (or G100 or whatever) -- maybe, i hope.
I don't expect any new AA modes either, but maybe they will be fixing the performance of 8-sample AA in new GT2xx chips.
 
Before you come to any weird assumptions that GT200 is bandwidth limited, you'd have to come up with some credible data (like benchmarks from various sites for instance) that indicate any of it. I haven't been able to see any indications so far, but feel free to link me to anything that I might have missed.

Check your PMs please, i have sent you the data of my project. :)
 
You're right, it is bleedingly obvious. Each memory channel is paired with a block of four ROPs. But why? You're saying that it should be easy for nVidia to rebalance the TPCs to 32 SPs with 8 TUs. At the same time, you're saying that they'll probably introduce GDDR5 support, which means reworking the mem. controller/ROP part, but suddenly it's impossible to have four channels with 8 ROPs each? ATI has done something similar with RV670, although I'm not sure whether they changed the number of channels (R600 had 8× 64bit) or just narrowed the channels to 32 bits, but that doesn't matter much.

If there's a need for 32 ROPs they could do it; I have still severe doubts that its actually necessary. The high end G92 are pulling back in most high resolutions + AA cases against a 8800GTX because of their by 50% smaller framebuffer and less bandwidth and not because of 8 less ROPs (always IMHLO).

As for RV670 and RV770 there are significant differences between each of the two architectures' ROP capabilities.

You can't compare 16×Q CSAA to 24× ED CFAA. 16×Q looks just a bit better than plain 8× with a box filter, the coverage samples don't really do much. The edge-detect filter is a performance hog, but provides the best quality.

Sorry for the nitpick, but it should read better quality or even more precisely the illusion of giving better quality since the real point of antialiasing is to trick the eye that most of the (aliasing) noise is gone. Personally to me if it comes to any form of aliasing my personal top priority lies in polygon interior data noise, where polygon edge/intersection aliasing comes second. If I have dancing meanders on my screen due to crappy filtering optimisations, or because of some insane developer having the funky idea that absurdely low negative LOD in textures looks "better", any Multisampling algorithm won't save my day either. In the second case a tent filter might prove even more sufficient compared to a edge detect filter, since the filter will at least cover some of the obsene texture aliasing (if any form of Supersampling isn't available).
 
If there's a need for 32 ROPs they could do it; I have still severe doubts that its actually necessary. The high end G92 are pulling back in most high resolutions + AA cases against a 8800GTX because of their by 50% smaller framebuffer and less bandwidth and not because of 8 less ROPs (always IMHLO).
I also think it's in the framebuffer. But GT206 should be significantly faster than G92 or G80. If they really do want to improve performance with AA, halving the number of ROPs (compared to GT200) is not a very good idea.
As for RV670 and RV770 there are significant differences between each of the two architectures' ROP capabilities.
I know, that's why "RV770 also has 16 ROPs like the RV670" is not an argument.
Sorry for the nitpick, but it should read better quality or even more precisely the illusion of giving better quality since the real point of antialiasing is to trick the eye that most of the (aliasing) noise is gone. Personally to me if it comes to any form of aliasing my personal top priority lies in polygon interior data noise, where polygon edge/intersection aliasing comes second. If I have dancing meanders on my screen due to crappy filtering optimisations, or because of some insane developer having the funky idea that absurdely low negative LOD in textures looks "better", any Multisampling algorithm won't save my day either. In the second case a tent filter might prove even more sufficient compared to a edge detect filter, since the filter will at least cover some of the obsene texture aliasing (if any form of Supersampling isn't available).
Yeah, I also prefer better textures to AA. Tent filters... well their effect is somewhat more full-screen, but they also tend to blur the picture. In some games, it affects the HUD and it's a bit uncomfortable IMO.
 
I also think it's in the framebuffer. But GT206 should be significantly faster than G92 or G80. If they really do want to improve performance with AA, halving the number of ROPs (compared to GT200) is not a very good idea.

If the ROPs are re-architected to compensate for the halving of functional units, the quantity would be irrelevant.
 
In that case, yes. But then it also depends on how you count them - AFAIK the current scheme is not exactly accurate, so 16 "fat" ROPs might be the same as 32 "slim" ones. Anyway, if we're speculating about GDDR5 support which means re-designing the ROPs, nVidia does have some flexibility there.
 
In that case, yes. But then it also depends on how you count them - AFAIK the current scheme is not exactly accurate, so 16 "fat" ROPs might be the same as 32 "slim" ones. Anyway, if we're speculating about GDDR5 support which means re-designing the ROPs, nVidia does have some flexibility there.

The idea here is that if NV really is going to adapt GT2xx to support GDDR5 with the goal of reducing bus-width and thus chip size/complexity, they'll have to re-architect the MC and at the very least re-organize the ROPs into partitions of 8 units. Otherwise they'll end up cutting the ROP count in half, effectively halving fillrates. It would be silly to create a successor to GT200 that has slightly more bandwidth (and BW efficiency) but half the fillrate.

The other option is as you suggest - to maintain ROP organization of 4 per partition, but re-architect the ROPs themselves to create "fat" ROPs with higher fillrates than their current ROPs.

Both options have drawbacks. The first option would create an abundance of fillrate for entry-level GT2xx-derived SKUs because the base unit for ROP partitions would now be 8 instead of 4. This would add complexity and cost to entry-level parts, which is rather antithetical to the nature of this market. The second option would take far more time to design and implement, which itself is antithetical to the idea of reducing chip cost as quickly as possible.

It will be interesting to see what route they take. Maybe they'll choose an altogether different option. I've seen some suggest a 24 ROP/384-bit MC arrangement coupled with higher clocks to hopefully offset the lower fillrates. I somewhat doubt the 55nm process would enable the sort of clockspeeds necessary to achieve ~ parity with GT200's fillrates, though.
 
http://66.196.80.202/babelfish/translate_url_content?.intl=us&lp=fr_en&trurl=http%3A%2F%2Fwww.hardware.fr%2Farticles%2F725-6%2Fdossier-amd-radeon-hd-4870-4850.html

IMG0023623.gif


http://www.behardware.com/articles/723-5/product-review-the-nvidia-geforce-gtx-280-260.html

IMG0023703.gif
 
The second option would take far more time to design and implement, which itself is antithetical to the idea of reducing chip cost as quickly as possible.
I somehow doubt that NVs implementation of GDDR5 MCs will be a reaction to the need of "reducing chip cost as quickly as possible".
It's not like GDDR5 is a surprise to them. They've probably planned to go GDDR5 for a while now and have designed all the needed blocks a while ago.
Do not forget that GT200 is using GDDR3 and 512-bit bus because it was supposed to be released before GDDR5 became available.
 
I somehow doubt that NVs implementation of GDDR5 MCs will be a reaction to the need of "reducing chip cost as quickly as possible".
It's not like GDDR5 is a surprise to them. They've probably planned to go GDDR5 for a while now and have designed all the needed blocks a while ago.
Do not forget that GT200 is using GDDR3 and 512-bit bus because it was supposed to be released before GDDR5 became available.

512-bit bus + GDDR5 is insane overkill. Why have an overly complex chip when you could get the same performance from a redesigned chip that costs less to build?
 
Where did i say that i'm expecting 512-bit bus with GDDR5 in the next GT2xx chips?

How else do you expect them to reduce chip size (and thus cost) drastically? A straight shrink to 55nm isn't going to accomplish that without re-working of the chip.
 
How else do you expect them to reduce chip size (and thus cost) drastically? A straight shrink to 55nm isn't going to accomplish that without re-working of the chip.

Who said they were gunning to replace the top-end part with a simple optical shrink to 55nm ? The slide mentions 45nm (could be 40nm though, hard to make out that last digit)... :???:
 
Back
Top