NVIDIA GF100 & Friends speculation

Even a overclocked refresh for AMD will have little chance against this card it appears.
Actually, we don't really know that. If an application really shows a bottleneck somewhere in tri setup rate (either due to just lots of vertices or tesselation), then yes it appears GF100 would blow it out of the water.
And certainly, despite Dave claiming improving tri setup rate would only improve performance by 2%, I think we've seen a fair bit of tri setup rate limitations already. Unless I'm missing something, how else do you explain 2x5770 are sometimes more than 20% faster than a single HD5870 (for instance here, http://www.guru3d.com/article/radeon-hd-5770-review-test/16)? The 5770 is very much a half 5870, with two of them you get in theory pretty much the same throughput in texturing/alu, rops, memory, except inefficiencies (depends heavily on game) due to crossfire, but you have twice the setup rate.
But certainly, other games don't show this, so I think sometimes indeed there will be not much benefit of the increased tri setup rate, and in other areas GF100 still has at least theoretical disadvantage (alu, texturing, though again some other areas it also has its advantages, rops, memory bandwidth - despite I think the 1200Mhz gddr5 clock suggested by hardware.fr is too optimistic).
I'd also agree the solution nvidia has chosen should allow quite nice scaling, but it's of little use right now as they can't scale it up for now, and on the lower end parts it'll probably not help much if anything and likely still be a more complex solution as the single global setup unit AMD uses.

Oh, and those 448 shader units parts are really strange. With 4 SMs per GPC, there has to be some asymmetry somewhere...
 
Are you sure Hemlock isn't faster in another games/benchmarks? Anyway, this isn't about transistors, it's manufacturing costs / performance (two 300mm² GPUs aren't more expensive then a single 450mm² one).

No, because i don't know how good GF100 will perform in games. But you can't compare Hemlock and GF100 in the unigine benchmark and come to the conclusion that nVidia wasted a lot of transistors in useless things.
This is a example of the microstuttering problem in the unigine benchmark with a 5970.
 
Last edited by a moderator:
Hang on, isn't triangle setup outside both DS and TS.
Yes, setup comes after VS->HS->TS->DS->GS.

Higher tess factors will surely increase time spent in both. So I can't see how DS or TS affect tri-setup.
If DS is a bottleneck, then of course that affects setup.

However, in any real DS, dozens of alu cycles will be spent interpolating the attributes, and I expect the TS to be faster than that, somewhere around 1 tri/clock.
DS can run on multiple ALUs in parallel. The entire GPU could, instantaneously, be running the DS shader.

My point is that when the GPU is running all shaders in a typical mix as seen in a game with tessellation active, which is the bottleneck in geometry processing: TS, DS or Setup?

It's worth noting that NVidia has biased RS towards smaller triangles, seemingly able to put 4 of them into a hardware thread. That might indicate that RS is actually the bottleneck, not any of the geometry stages that precede it.

But in order for NVidia to go with 2x4 rasterisation (small triangles) NVidia had no choice but to implement 4 setup units, to maintain some hope of keeping up when triangles are larger.

So, taken this way, the 4 setup units in GF100 are a total PR exercise, diverting attention from the real issue: they're not there to attack a setup bottleneck in games with tessellation, they support the 4 rasterisers which can be more efficient when they're working on small triangles but also to stop the GPU grinding to a halt on large triangles.

So, anyone with any evidence that tessellation actually creates a setup bottleneck in HD5870, step forward.

Jawed
 
But in order for NVidia to go with 2x4 rasterisation (small triangles) NVidia had no choice but to implement 4 setup units, to maintain some hope of keeping up when triangles are larger.

So, taken this way, the 4 setup units in GF100 are a total PR exercise, diverting attention from the real issue: they're not there to attack a setup bottleneck in games with tessellation, they support the 4 rasterisers which can be more efficient when they're working on small triangles but also to stop the GPU grinding to a halt on large triangles.

Would this be taking into account that the rasterizers would be running at 1/2 shader clock, while the L2/ROPs (according to Anandtec) run in their own domain, possibly the remnant of the non-hot clock domain of GT200?

If Nvidia had hit the high end of its target clocks, the throughput of the rasterizers would have matched up more closely with the 48 ROPs, if they stayed at the rather sedate global clocks of the preceding architectures.

Perhaps the overhead for keeping 4 rasterizer blocks properly consistent is too high if they had used something wider than 2x4.
Of all the hardware costs possible, I wouldn't think the cost of having of just having additional scan/raster units per block would be as significant as other things Nvidia has splurged on.
 
I'm sorry but I'm not the one who operate on "faith" here.
What you already know about GF100 leaves no reason to prefer a 5(8/9)70 over GF3(6/8)0 (Eyefinity is nice and all but if I would ever want to play with several 1080p monitors I'll certainly want more performance than even 5970 is able to provide so I consider them even in that regard).
What we still need to know to make a final judgement is price/perfomance ratios of course. But I'll be very surprised if NV won't set right prices knowing almost everything about competing AMDs cards this time around.
Everything beyond that is "faith".
Do nvidia know about amd's refresh which can come out the same time as fermi or do you think that ati made a video card and said hey we don't need anything better than this at all for the next year ?


So tesselation in EG is essentially useless? Great! You should tell that to everyone who bought that first to market DX11 solution and tell them not to worry -- AMD will kindly ask them to buy another, better card soon. Ain't all of this sounds great?
How is this diffrent than anything else ? Its been that way with nvidia products since the tnt cards and it will be that way with the nvidia solution. When the geforce finally hits the market ati will have had cards on the market for roughly 6 months and for most of that time they will have been selling cards priced between $100 and $600. Recently ati have released sub $100 cards with more to fill in. We are still waiting on nvidia's dx 11 cards and it will be a long time coming before nvidia hits the same price points as ati. So what performance do you think developers are going to target. The ati cards or the non existant nvidia cards ?

Aside from all that ati is rumored to have a new tech coming out later this year. So will many fans be pissed if their 6 month old $400 card doesn't perform tesselation as quickly as a brand new card does that costs $100-$200 more ? Will people who bought a $150 ati dx 11 card be upset that a $350 + costing card performs better ?
 
I think we've seen a fair bit of tri setup rate limitations already. Unless I'm missing something, how else do you explain 2x5770 are sometimes more than 20% faster than a single HD5870 (for instance here, http://www.guru3d.com/article/radeon-hd-5770-review-test/16)? The 5770 is very much a half 5870, with two of them you get in theory pretty much the same throughput in texturing/alu, rops, memory, except inefficiencies (depends heavily on game) due to crossfire, but you have twice the setup rate.
What about shared memory bandwidth, for example, being exactly the same between Juniper and Cypress?

There is a bottleneck which is common to Juniper and Cypress, core frequency related but apparently not setup related. If you have both at hand, you can verify Cypress performs worse even without many triangles when you reduce core clock to 425MHz and doesn't loose much (if at all) with many triangles.

For Heaven case, I eliminated memory bandwidth, blending, texturing and ALU throughput on top of setup rate. Perhaps it's a driver issue, or it's an internal latency/throughput issue, but it's hardly anything else.


DS can run on multiple ALUs in parallel. The entire GPU could, instantaneously, be running the DS shader.

My point is that when the GPU is running all shaders in a typical mix as seen in a game with tessellation active, which is the bottleneck in geometry processing: TS, DS or Setup?
Exactly what I pointed out earlier and even yesterday.

They're basically showing their higher setup as a joker, but DS and HS could require more ALU usage than technically available just for tessellation.
 
Yes, setup comes after VS->HS->TS->DS->GS.


If DS is a bottleneck, then of course that affects setup.


DS can run on multiple ALUs in parallel. The entire GPU could, instantaneously, be running the DS shader.

My point is that when the GPU is running all shaders in a typical mix as seen in a game with tessellation active, which is the bottleneck in geometry processing: TS, DS or Setup?

It's worth noting that NVidia has biased RS towards smaller triangles, seemingly able to put 4 of them into a hardware thread. That might indicate that RS is actually the bottleneck, not any of the geometry stages that precede it.

But in order for NVidia to go with 2x4 rasterisation (small triangles) NVidia had no choice but to implement 4 setup units, to maintain some hope of keeping up when triangles are larger.

So, taken this way, the 4 setup units in GF100 are a total PR exercise, diverting attention from the real issue: they're not there to attack a setup bottleneck in games with tessellation, they support the 4 rasterisers which can be more efficient when they're working on small triangles but also to stop the GPU grinding to a halt on large triangles.

So, anyone with any evidence that tessellation actually creates a setup bottleneck in HD5870, step forward.

Jawed

Emphasis mine.

I think you are onto something. The "serial" 5870 also has 2 rasterizers, which can operate in parallel for non-overlapping tris (should be rather common when tess is turned on) fed by a single setup unit. May be it really is a rasterizer limitation, instead of a setup limitation.

@850MHz, you can do 850Mtris/sec. If a game pushes 1Mtris/frame, then at 60fps, we are at ~7% of setup peak. Given this, it is hard for me to believe if games are actually setup limited, even in corner cases.

And certainly, despite Dave claiming improving tri setup rate would only improve performance by 2%, I think we've seen a fair bit of tri setup rate limitations already. Unless I'm missing something, how else do you explain 2x5770 are sometimes more than 20% faster than a single HD5870 (for instance here, http://www.guru3d.com/article/radeon...view-test/16)? The 5770 is very much a half 5870, with two of them you get in theory pretty much the same throughput in texturing/alu, rops, memory, except inefficiencies (depends heavily on game) due to crossfire, but you have twice the setup rate.

This limitation could be because of rasterizer too.
 
What about shared memory bandwidth, for example, being exactly the same between Juniper and Cypress?

There is a bottleneck which is common to Juniper and Cypress, core frequency related but apparently not setup related. If you have both at hand, you can verify Cypress performs worse even without many triangles when you reduce core clock to 425MHz and doesn't loose much (if at all) with many triangles.

For Heaven case, I eliminated memory bandwidth, blending, texturing and ALU throughput on top of setup rate. Perhaps it's a driver issue, or it's an internal latency/throughput issue, but it's hardly anything else.
Hmm, that's a good point. I thought though the shared memory bandwidth shouldn't really make much of an impact with graphics, though I guess texturing (l2-l1 bandwidth) could suffer for instance. Or maybe even things like thread dispatch aren't fast enough?
So you're probably right, and it isn't setup limited. Doesn't mean though with tesselation it couldn't become a limitation (though if it really would make a difference, I think game developers would probably include an option to turn down tesselation, should be pretty easy no?)
In any case, rv8xx shows some signs of not too good scaling (at least going from juniper to cypress), whatever the reason. Might be just what's needed for GF100 to be able to be faster, if it scales any better...
 
Doesn't mean though with tesselation it couldn't become a limitation (though if it really would make a difference, I think game developers would probably include an option to turn down tesselation, should be pretty easy no?)
Well, if 850Mtris/s is not enough, as I already said it's the scene/engine is a pile of shit.

AMD's sample is quite interesting to look at when it comes to tessellation tweaking, so is Heaven with its distance adaptative method, and it's quite hard to think of any case where performance would suffer more from setup than shaders.

NVidia went the "sub-pixel triangles" path, but is it really relevant?
 
PSU-failure said:
They're basically showing their higher setup as a joker, but DS and HS could require more ALU usage than technically available just for tessellation.

Like others already said tessellation definitely puts pressure on the shader core. However none of the theories offered so far explains Nvidia's numbers in the Cypress comparison.

Saying the increased setup rate is just for PR is beyond silly. Even if we were to assume that the real bottleneck is rasterization of small triangles then distributed rasterization would inherently lead to distributed setup to feed those rasterization units.
 
Do nvidia know about amd's refresh which can come out the same time as fermi or do you think that ati made a video card and said hey we don't need anything better than this at all for the next year ?

<snip>

And you are making the same mistake. You are somehow assuming that NVIDIA is also sitting idle. They surely are not. They know they are late and they must be working on their next products. Think of the last time NVIDIA was late and what happened next. Whether Fermi succeeds of fails in this generation (because it's late, most likely not because of its performance), it surely is the basis for their next generations and they are certainly working on them as we speak.
 
But in order for NVidia to go with 2x4 rasterisation (small triangles) NVidia had no choice but to implement 4 setup units, to maintain some hope of keeping up when triangles are larger.

So, taken this way, the 4 setup units in GF100 are a total PR exercise, diverting attention from the real issue: they're not there to attack a setup bottleneck in games with tessellation, they support the 4 rasterisers which can be more efficient when they're working on small triangles but also to stop the GPU grinding to a halt on large triangles.
I'm amazed time and again. :) You sound a bit like Charlie, only more techie and educated in your ability to turn almost everything against Nvidia (I don't mean that in a bad way - I'm absolutely fine with different perspectives).

But what if cause and effect the way you take it are reversed?
 
Buying a GF100 right now blindly is what I'd call faith, as it would be hardly an informed decision.
Right now no one can buy a GF100. So you call "faith" something that doesn't exist? OK then.

There are more question marks than facts at the moment, so please, let me be sceptical, as I rarely have a lot of faith in anything / anyone.
So you're sceptical because of what you don't know? 8)
While I do have questions about new TMUs I see nothing else to be sceptical about in what was disclosed today and lots of cool stuff never done before in a GPU. So why would I be sceptical unless i'm biased?

Please, it makes me nauseous when someone tries to tell me that I said something I didn't. Fermi's tesselation is clearly superb, but does this render Evergreen useless? Remains to be seen. Maybe Fermi's tesselation is an overkill?
You have a benchmark result -- Unigine. With Cypress you gain 27 fps, with GF100 -- 43. Is it an overkill? Is 27 fps enough for you?

I've been lurking long enough in these forums to know your bias. But please let me remain sceptical, even though you may "find my lack of faith disturbing" ;).
Yeah, and I've been lurking long enough to know that most of people saying "you're biased" are quite biased themselves to begin with. You're not a judge here, you're as subjective as everyone else.

There is one obvious reason... they're available.
Well let's say that I have a GTX285. Why would I buy 5870 for a ~+30% performance if I know that in a couple of months I may get a GF100 with +60-150% performance? I see no reason to do this so for me as a GTX285 owner 5800 avialability isn't a reason to not wait for GF100 (and everything less is simply slower than my GTX285 which I've bought a year ago; no, I'm not a fan of AFR cards for $700, thanks).

And no, what we know wouldn't even change my decision if I were in need, pure bottleneck-free "benchmarks" are not informations but technical details which could very well have zero effect on performance, the few informations we have all say it's not a given GF100 will be faster, even with those fancy "4 times or even more faster" charts.
Technical details can't have zero effect on performance.
The few informations we have clearly say that GF100 will be faster than Cypress. The only question right now is how much faster. Those fancy "4 times or even more faster" charts means that under some conditions it may be 4 times or even more faster than Cypress. As simple as that.

Before today, we knew nothing but could speculate about what GF100's performance would be whereas we knew Cypress performance.
Today, we know little more except they worked hard to reduce bottlenecks, but they don't show any result proving it was relevant.
Sure they are idiots there in NV, they always remove irrelevant bottlenecks. I mean it's clear that parallelization of something serial is a bad way -- no one in the industry is doing such things at the moment! [/end sarcasm]
It's a question of you grasping for straws more than NV not disclosing any information. They've disclosed quite enough to make an informed guesses about performances.

And then you write this gem:
As far as I remember you wasn't too fond of G80 either so why should I bother?
 
And you are making the same mistake. You are somehow assuming that NVIDIA is also sitting idle. They surely are not. They know they are late and they must be working on their next products. Think of the last time NVIDIA was late and what happened next. Whether Fermi succeeds of fails in this generation (because it's late, most likely not because of its performance), it surely is the basis for their next generations and they are certainly working on them as we speak.

THat is not a mistake i'm making. Obviously nvidia was 6 months late. So for awhile at least we will see nvidia and ati competing with diffrent parts against each other. Ati launched a new gpu. Now its nvidia's turn for a new gpu and ati will be on a refresh against. Then ati releases a new gpu when nvidia is a refresh.

My point is that ati will be releasing something new around when fermi actually comes out. By the time there are enough dx 11 games with tesslitation those who bought a 5870 will be on another card. For the time being they have been enjoying the performance and features a 5870 brings to the table. Because of ati's lead future dx 11 games will be based on the 5x00 series performance and thus the 5x00 series is not a wasted investment or a product bought on faith.

THe poster i was responding to acted like the fermi was the second coming. Whats he going to do 6 months after fermi comes out and its not longer the second coming. Will he damn it like he is damning the 5x00 series ?
 
Back
Top