NVIDIA Fermi: Architecture discussion

Jawed · Nov 22, 2009

trinibwoy said:
I believe dkanter raised the issue at one point and also pointed out that Fermi did not have the option to do the intermediate rounding. Don't think he's a graphics programmer but nobody has stepped up to say that it's NOT an issue either.

Microsoft has increased precision requirements in D3D I believe - but to suggest that graphics rendering will be "broken" or artefacted by having too much shader precision is a reach in my view.

The difference between existing ATI and NVidia cards already entirely undermines this argument. Older versions of DirectX don't have mandated rounding, and shader compilation (i.e. ordering affects precision) further muddies things.

How is it the same? Increasing efficiency doesn't guarantee that you won't be bandwidth limited.

OK, back to square one: HD5870 with less bandwidth is outperforming GTX285 and appears to be moderately bandwidth limited. That's the kind of target, per GB/s performance, we should be expecting from GF100.

Z rate and seemingly fp16 post-processing are the heavy bandwidth users. Z rate is 850MHz x 32 x 4 = 108GP/s in HD5870 versus say 600MHz x 48 x 8 = 230GP/s in GF100. fp16 rate is 850MHz x 80 x 0.5 = 34GT/s versus 600MHz x 128 x 0.5 = 38GT/s. So between the two, perhaps 50%+ performance assuming they're reasonably equivalent bottlenecks and presuming that NVidia can improve the efficacy of these units.

Also, I'm presuming that architectural improvements will only remove bottlenecks from non-bandwidth consuming functions (e.g. increased ALU:TEX), i.e. bandwidth limitations in texturing and fillrate will become more important.

I will happily admit it's now really murky trying to assess how a game is bottlenecked - Crysis is fantastically obscure, for example. So only the IHVs can really see how the mix of units and respective clock rates affect things.

Except 2900XT was slow at nearly everything, not just one setting.

That was completely sarcastic

Picking a setting where one architecture has an obvious performance cliff is cherry picking. Using 4xAA isn't cherry picking since performance at that setting typically scales in line with performance at other settings - 0xAA, different resolutions etc. 8xAA is the outlier.

You mean like turning on shadows in games where NVidia's patent (PCF in TMUs) meant that competing graphics cards couldn't implement that algorithm in hardware? Until D3D10. Yes, cherry-picking of that sort has been going on for years.

Also, given the theoretical Z rate of GT200, the fact it has a performance cliff is not the reviewer's problem - eye-candy is eye-candy and enthusiast-level cards have no excuses. Overall I think it's ignorance/laziness/perceived-as-irrelevant.

There's still a fundamental question for reviewers: are we trying to assess which is the most powerful card (how can we make them weep?) or are we trying to assess which is faster at de-facto settings (meaningful to typical readers in typical games).

http://www.xbitlabs.com/articles/video/display/radeon-hd5800-crossfirex_6.html

Jawed

Dave Baumann · Nov 22, 2009

Jawed said:
Ooh, definitely agreed. Eyefinity's more of the same and it seems to defeat HD5970 quite easily. Though I think that's because it's a kludge dreamed up after the hardware was designed. Otherwise there'd be none of this fucking around with Dell mini-DP adaptors. But that's for another thread.

Hardly. You don't magic out an extra 4 display pipelines as an afterthought.

fellix · Nov 22, 2009

trinibwoy said:
0xAA, different resolutions etc. 8xAA is the outlier.

Oh, noes!
Anything multiplied by zero is resulting... zero, e.g. nothing will be rendered on the screen output.

Just a friendly reminder.

Jawed · Nov 22, 2009

Dave Baumann said:
Hardly. You don't magic out an extra 4 display pipelines as an afterthought.

I never said AMD did - AMD documents this capability as being for laptops to cater for all the display configurations in various docking and multi-screen scenarios. This capability has been kludged into Eyefinity. Otherwise the cards would have 1 DVI and 2 HDMI ports or 2 DVI and 1 HDMI and the scrabbling around to find a working solution wouldn't have happened. Active adaptors that cost $100, who's kidding who?

Jawed

Dave Baumann · Nov 22, 2009

Jawed said:
I never said AMD did - AMD documents this capability as being for laptops to cater for all the display configurations in various docking and multi-screen scenarios. This capability has been kludged into Eyefinity. Otherwise the cards would have 1 DVI and 2 HDMI ports or 2 DVI and 1 HDMI and the scrabbling around to find a working solution wouldn't have happened. Active adaptors that cost $100, who's kidding who?

Requirement for notebooks was the capability of the digital lane outputs, which doesn't equate to the need for driving the display independantly; Eyefinity draws off of that but extends it to support as many dispaly pipelines as it does digital display lanes - this isn't a "kludge" this is a consious design choice that adds new capabilities beyond the requirements of the notebook vendors.

The requirement for driving the extra displays, though, has always been via DisplayPort as DisplayPort can be timed from a single clock generator (which are costly to integrate) - the Eyefinity6 Edition, when driving 6 DP's needs just one clock source; DVI's or HDMI's need a clock source per device. The outputs of the "standard" boards were designed to support the same level of output that we had done on previous genrations but allow users to extend that to support 3 panels via DisplayPort.

mczak · Nov 22, 2009

Jawed said:
In R800 FMA is only available on the X,Y,Z,W lanes. All five lanes have MAD. Seems like a cost-cutting measure - the availability of FMA might have something to do with the DP implementation too, i.e. there are extra bits there anyway, so they got used for FMA.

So Juniper (and Redwood/Cedar) don't support FMA?

Jawed · Nov 22, 2009

mczak said:
So Juniper (and Redwood/Cedar) don't support FMA?

Good question. I don't know...

Jawed

trinibwoy · Nov 22, 2009

fellix said:
Oh, noes!
Anything multiplied by zero is resulting... zero, e.g. nothing will be rendered on the screen output.

Not really. I usually read it as number of samples per pixel used for AA. So the base case is 0x

Jawed said:
Microsoft has increased precision requirements in D3D I believe - but to suggest that graphics rendering will be "broken" or artefacted by having too much shader precision is a reach in my view.

Hopefully cause if there are issues with this then it will be a big headache for the driver team.

OK, back to square one: HD5870 with less bandwidth is outperforming GTX285 and appears to be moderately bandwidth limited. That's the kind of target, per GB/s performance, we should be expecting from GF100.

My initial thought was much simpler than that - i.e efficiency improvements aside, could Fermi hope to challenge HD5970's nominal bandwidth with current GDDR5 modules on a 384-bit bus. At first I was skeptical because I assumed much higher mem clocks for HD 5970. I'm a bit less skeptical now after realizing it only goes up to 256GB/s. But that's probably a useless comparison anyway given what we've seen with theoretical bandwidth numbers being essentially meaningless as a performance indicator.

Comparisons on efficiency are difficult and really comes down to how you want to spin it. HD5870 has only about 25% more bandwidth than HD4890 yet manages to outrun it by 60-70% on average. That's great when looking at bandwidth. But what about the fact that 60-70% is lower than the 100% increase in texturing and shading resources? Is it bandwidth efficient or simply inefficient in those other areas.

You could do a similiar exercise for GT200 and G92 and claim GT200 was really efficient at using its texture or shader units because the performance gain was much higher than the theoretical increase. Yet everybody still pans it for inefficiency no?

You mean like turning on shadows in games where NVidia's patent (PCF in TMUs) meant that competing graphics cards couldn't implement that algorithm in hardware? Until D3D10. Yes, cherry-picking of that sort has been going on for years.

Yeah but those more egregious examples don't support the argument for using 8xAA as an "average case"

Of course there are very valid reasons why you think 8xAA should be used in reviews (why shouldn't it?) but it'll be very hard to promote that as a typical scenario. In other words, if you had to choose a single setting that gave you an idea of the general performance standings of the various architectures and products based on them would you choose 8xAA?

There's still a fundamental question for reviewers: are we trying to assess which is the most powerful card (how can we make them weep?) or are we trying to assess which is faster at de-facto settings (meaningful to typical readers in typical games).

The first goal is achievable I think, the latter not so much. There is no such thing as de-facto settings in a world where everybody has different tastes and more importantly, different monitors - this is the main reason why I find Hardocp's approach particularly useless.

Ailuros · Nov 23, 2009

trinibwoy said:
The first goal is achievable I think, the latter not so much. There is no such thing as de-facto settings in a world where everybody has different tastes and more importantly, different monitors - this is the main reason why I find Hardocp's approach particularly useless.

A reviewer could combine both in a review. You just take from each game the worst case scenarios and the typical scenario that represents more or less average behaviour in each game.

Performance results already scale in resolutions, meaning that most reviews have the different monitor point covered already. I'd just personally wouldn't include 1280/1680 resolutions while benchmarking high end GPUs.

Not really. I usually read it as number of samples per pixel used for AA. So the base case is 0x

It might sound like hair splitting but 1xAA/1xAF means no AA/no AF.

rpg.314 · Nov 23, 2009

trinibwoy said:
Not really. I usually read it as number of samples per pixel used for AA. So the base case is 0x

No, the base case would be 1xAA.

compres · Nov 23, 2009

rpg.314 said:
No, the base case would be 1xAA.

Exactly. Otherwise the increase in samples would be infinite, which is certainly not the case.

Silus · Nov 23, 2009

Jawed said:
Versions plural? What's your definition of version?

Jawed

As in A1 and/or A2 silicon.

Silus · Nov 23, 2009

digitalwanderer said:
Same thing happened in a lot of forums back in the nV30/FX days, probably for the same reason too.

Forum lawyer hat on: Does this mean we aren't free to speculate on things from now on unless we are absolutely sure of the facts? If so I see a huge decline in posting in general...

Sorry, it's Friday and I was bored.

I wasn't there during that time, but I was during the R600 flop. And I saw nothing like people almost foaming out of the mouth, in order to bash NVIDIA, hoping that it goes under.

The funny thing is that these same people created many threads during R600, literally begging people to "support the underdog" so that they didn't die

CarstenS · Nov 23, 2009

Jawed said:
You mean like turning on shadows in games where NVidia's patent (PCF in TMUs) meant that competing graphics cards couldn't implement that algorithm in hardware? Until D3D10. Yes, cherry-picking of that sort has been going on for years.

Except that, AFAIR, most games did not emulate this 'missing function' via shaders but only had unfiltered shadow map edges then (Battlefield 2 comes to mind).
Plus: I didn't know that Nvidias patent was preventing others from doing similar things - as they've shown with their DX10 archs.

Jawed said:
Also, given the theoretical Z rate of GT200, the fact it has a performance cliff is not the reviewer's problem - eye-candy is eye-candy and enthusiast-level cards have no excuses. Overall I think it's ignorance/laziness/perceived-as-irrelevant.

I don't think that theoretical Z's the problem. It's bandwidth limited in it's abundance as well on GTX28x as on HD 5870. In MDolenc's Fillrate Tester both are way short of their theoretical rates. Unfortunately, I've only got number with not-the-newest drivers and including some OC.

Double Z rate:
HD 2900 XT is at 88,6% with bytes per pixel
HD 3870 ist at 87,2% with bytes per pixel

Quad Z rate:
HD 4770 ist at 55,8% with bytes per pixel
HD 4870 is at 62,8% with bytes per pixel
HD 5870 is at 67,9% with Bytes per Pixel

8800 GTX is at 56,6% with Bytes per Pixel
GTX 280 is at 47,7% with Bytes per Pixel

Interestingly, both GTX 280 (147,4 GB/s) and HD 5870 (153,6 GB/s) have a very similar raw bandwidth and are scoring although very similar in MDolencs Fillrate Tester's Z-Portion: 73,535 vs. 73,969 GZix/sec. - only one run, might vary a bit.
--
Or did you mean Z-rate with 8x MSAA enabled?

Jawed said:
There's still a fundamental question for reviewers: are we trying to assess which is the most powerful card (how can we make them weep?) or are we trying to assess which is faster at de-facto settings (meaningful to typical readers in typical games).

http://www.xbitlabs.com/articles/video/display/radeon-hd5800-crossfirex_6.html

Jawed[/QUOTE]
Agreed, that's one of the basic decisions you'd have to make, if you're not putting the work of two reviews into one.

trinibwoy · Nov 23, 2009

rpg.314 said:
No, the base case would be 1xAA.

What do you mean "No"?

Is there a rule about how to represent "no AA" or is that just an opinion?

compres said:
Exactly. Otherwise the increase in samples would be infinite, which is certainly not the case.

Why? The notation simply refers to the number of subsamples. Where does infinity come into play? I actually think 0xAA is more intuitive as 1xAA implies some sort of AA is being applied which isn't the case.

rpg.314 · Nov 23, 2009

trinibwoy said:
What do you mean "No"? Is there a rule about how to represent "no AA" or is that just an opinion?

AFAICS, it is more than an opinion, it's the right way to describe things.

Why? The notation simply refers to the number of subsamples. Where does infinity come into play? I actually think 0xAA is more intuitive as 1xAA implies some sort of AA is being applied which isn't the case.

0xAA is more intuitive, yes, but not correct as it implies (to me atleast) that 0 samples were chosen. 2xAA is the minimum AA as you take 2 jittered samples per pixel.

fellix · Nov 23, 2009

NoAA == 1xAA =! 0xAA == NoOutput

Does my pseudo-code compiles?

About the colour/z-rates, I think ArchMark is still the better tool to measure (with lowest overhead). I have a bunch of reference results, but none from GT200(b) hardware. Anyone?

Blazkowicz · Nov 23, 2009

remember the "Turbocache" geforces followed by ATI "hypermemory"? where you have a small, slow pool of ram on your graphics cards, while using system memory simultaneously.

IIRC there was XGI doing the same and calling it "extreme cache" but there was to be a version featuring no memory whatsoever. i.e., a zero megabyte exxtreme cache!

AlexV · Nov 23, 2009

fellix said:
0xAA == NoOutput

Does the absence of Anti-Aliasing imply the absence of output? As for what is or isn't correct, that tends to depend on what it is you're counting and why, IMHO.

Blazkowicz · Nov 23, 2009

at that point it's only human language : inconsistent and convenient. We have weirder stuff : 1MB is 1024KB, 1.44MB is 1440KB, 1Mbit/s is one million bits per second, and so on.

NVIDIA Fermi: Architecture discussion

Jawed

Dave Baumann

Gamerscore Wh...

fellix

Jawed

Dave Baumann

Gamerscore Wh...

mczak

Jawed

trinibwoy

Meh

Ailuros

Epsilon plus three

rpg.314

compres

Silus

Silus

CarstenS

Moderator

trinibwoy

Meh

rpg.314

fellix

Blazkowicz

AlexV

Heteroscedasticitate

Blazkowicz

Similar threads