NVIDIA Fermi: Architecture discussion

The kickass performance is pretty meaningless until competition arrives. NV30 would have seemed kickass if the 9xxx series from ATI did not exist.

Well, G80 almost doubled the performance of last gen's (from nv) high end card as well. and Hemlock is certainly a huge jump, if not outright double.

The bigger news is that fermiX2 seems to be alive. If so, then nv will prolly capture the single card drown comfortably.
 
Well, G80 almost doubled the performance of last gen's (from nv) high end card as well. and Hemlock is certainly a huge jump, if not outright double.

The bigger news is that fermiX2 seems to be alive. If so, then nv will prolly capture the single card drown comfortably.

Compared to today's products then yes its quite possible however given the supposed (even further) delay of Fermi based products any such gemini product could be facing either a 28nm 5800 refresh or as some have proposed (thou I doubt) Rv900 based products. Best guess would put fermi gemini at around july of 2010.
 
Was just going to post that, looks like Rys snuck into TR and wrote an article when they weren't looking. Lots of interesting stuff in there....looks like he's been holding out on us.

Seems pretty sure about:

- There being one DP and one SP SIMD per core
- Memory clocks of 4200Mhz for ~200GB/s bandwidth
- Core and shader clocks of 650 and 1700 respectively
- Higher quality texture filtering
- Minimal dedicated tessellation hardware
 
So concerning geforce fermi "very soon" has the same meaning to nvidia as saying "Easter is very soon" two months before Christmas?!
Now that's great.
Let's twitter more Pics, nvidia.
 
Maybe he can also explain why there's no DX10.1 then on GT200 and they need a new architecture for it?

Bad editing in the article all-around, is this a botch translation?

No, that's just bad english (and maybe a bit of ignorance, too: Firmy? :oops: ) :LOL:.

I don't share that opinion, while the R600 (HD2900XT) release was late, ran hot and was under performing it did eventually spawn the likes of the RV670, RV635, RV620 (HD3000) which lead to the highly succesful RV770/740/730/710 (HD4000) and now the HD5000 series.. all from the carcass of the nearly still-born R600.

Wow. Someone who agrees with me on the R600. This is a historical date. I still to this day don't think the R600 was that bad. Specially when you look at how AMD designs have developed since it.

QFE. R600 was not a good deal by itself, but it surely contained a lot of good ideas, that in the long term proved themselves better than the ones that were boasted by G80. ;)

Well, G80 almost doubled the performance of last gen's (from nv) high end card as well. and Hemlock is certainly a huge jump, if not outright double.

The bigger news is that fermiX2 seems to be alive. If so, then nv will prolly capture the single card drown comfortably.

Given the 225W TDP that already surfaced for Tesla high-end cards, I don't think that Gemini is that near... The fact that a project exists doesn't mean that the thing will be on the market.
 
So concerning geforce fermi "very soon" has the same meaning to nvidia as saying "Easter is very soon" two months before Christmas?!
Now that's great.
Let's twitter more Pics, nvidia.

Makes you wonder how much "weekly fun facts" they have left.
 
Hopefully cause if there are issues with this then it will be a big headache for the driver team.
The order of computation can have a dramatic effect on the result, the classic example being something like 200-2*(100+0.000000001) in single-precision. FMA versus MUL then ADD alludes to a similar kind of ordering issue, one choice resulting in the use of FMA, the other path resulting in MUL then ADD. So, in scientific computing the programmer would have to be very careful to manage how the PTX generated by CUDA gets translated into hardware instructions. Gulp. Presumably there are hints or maybe explicit PTX instructions. Even with MAD there might be different rounding modes or other fiddlesome details. (ATI has two different MUL instructions, for example.)



In graphics MAD is effectively MUL then ADD, so while there are three possible sequences of computation:
  • MAD
  • MUL then ADD
  • FMA
there's really only MUL then ADD, or FMA to contemplate. If the programmer writes MUL then ADD, the compiler for an older GPU would usually just make that MAD. If the new GPU's compiler does FMA (and there is no MAD) the result will be different but I don't see how that's going to show.

For this to show you'd need a shader that computes this stuff more than once, first time with MUL then ADD, later with FMA. Then do some operation on those two results, which would normally show an error only if they were computed differently but assumed to be the same. But I'd expect the shader compiler to see the same result being computed twice and simply replace the second computation with the result of the first, rather than issue an FMA.

You might argue that dependent texture addressing computations might show a very slight difference, a bias, I suppose. A very large texture would magnify any difference in the precision of MUL then ADD versus FMA, I guess (if a lot of them were performed, to compound errors).

The only other thing I can think of is that a vertex shader does MUL then ADD but a pixel shader uses FMA. If the programmer decides to limit the number of attributes exported by a vertex shader (saving bandwidth, or to avoid trying to export too many attributes), e.g. by making one attribute dependent upon others and then getting the pixel shader to re-compute that value again. There might be a systematic error in the visual result, I suppose.

Variance shadow mapping and its kin are sensitive to precision issues (despite using fp32) and developers use some kind of biasing to minimise artefacts. I get the impression these things are hand-tuned and perhaps there's some artist discretion involved. FMA in the computation might result in subtly different results, enough that someone would want to tweak the bias. Erm...

My initial thought was much simpler than that - i.e efficiency improvements aside, could Fermi hope to challenge HD5970's nominal bandwidth with current GDDR5 modules on a 384-bit bus. At first I was skeptical because I assumed much higher mem clocks for HD 5970.
I have to admit I don't consider HD5970 a meaningful competitor (no matter what AMD thinks, AFR and driver yuk) so my perspective is purely about competing with HD5870, GTX285 and GTX295 (GTX295 only because of what happened to GTX280 in comparison with 9800GX2).

I'm a bit less skeptical now after realizing it only goes up to 256GB/s. But that's probably a useless comparison anyway given what we've seen with theoretical bandwidth numbers being essentially meaningless as a performance indicator.
Well, I've long said that HD4870 (and HD4890 by implication) have way more bandwidth than they need. AMD admits as such in the HD5870 slides. So bandwidth comparisons are up-ended by starting with "the wrong baseline".

Separately, give HD5870 something to get its teeth into (Arma 2 is good) and it'll happily show very significant scaling. Stalker Clear Sky shows promise too:

http://www.computerbase.de/artikel/...adeon_hd_5970/15/#abschnitt_stalker_clear_sky

Comparisons on efficiency are difficult and really comes down to how you want to spin it. HD5870 has only about 25% more bandwidth than HD4890 yet manages to outrun it by 60-70% on average. That's great when looking at bandwidth. But what about the fact that 60-70% is lower than the 100% increase in texturing and shading resources? Is it bandwidth efficient or simply inefficient in those other areas.
Once you've normalised for a more meaningful, i.e. considerably lower, bandwidth on HD4890, games look more bandwidth-limited on HD5870. Additionally overclocking results seem to show a degree of bandwidth sensitivity in HD5870. To be honest it's a subject I haven't delved into, partly because immature drivers get in the way. e.g. the revised interpolation scheme theoretically has a significant knock-on effect just in terms of the ALUs, let alone TUs and bandwidth.

You could do a similiar exercise for GT200 and G92 and claim GT200 was really efficient at using its texture or shader units because the performance gain was much higher than the theoretical increase. Yet everybody still pans it for inefficiency no?
NVidia did make some significant improvements in GT200. R600 flattered G80, to a degree...

I've been wondering how things would have been if HD4870 had been made with 32 RBEs instead of 16. That spare die space that was assigned to 2 extra cores (about 8% of the die) would have been enough for doubling the RBEs (~4% of the die for 16), I reckon. For the die space I think you could argue HD4870 was pretty wasteful. But since the die size "miss" was pretty big I guess we have to accept there's a lot more opportunism in the final die when the chip design undergoes radical change.

Interestingly the GF100 boards that we've seen so far seem not to have an NVIO chip. Could that reflect the fact that the implementation is more measured than G80 or GT200, i.e. NVidia has a much clearer idea of how the chip's going to pan out? Or is that a side effect of 40nm woes, allowing more time? Or, could it be a reflection of CUDA customers wanting video output?...

Yeah but those more egregious examples don't support the argument for using 8xAA as an "average case" :) Of course there are very valid reasons why you think 8xAA should be used in reviews (why shouldn't it?) but it'll be very hard to promote that as a typical scenario. In other words, if you had to choose a single setting that gave you an idea of the general performance standings of the various architectures and products based on them would you choose 8xAA?
For enthusiast cards, undoubtedly.

The first goal is achievable I think, the latter not so much. There is no such thing as de-facto settings in a world where everybody has different tastes and more importantly, different monitors - this is the main reason why I find Hardocp's approach particularly useless.
I like their approach - between the formal reviews and the game-specific card tests I think they do the best job in communicating the real value of cards to gamers. I also think X-bit's reviews, with their bias towards brutal settings, give a good impression of how a card performs. Most English-language sites paint a rosy picture in my view, meaning that they only provide some data that can be used in trying to assess the comparative performance of cards. Usually they fail at that. And then there's the shenanigans with varying driver versions for each IHV within a review, which mucks things up.

To be honest, these days the data is so bad at most sites I get bored rooting around trying to find something to use when trying to decide how things are scaling. But then, evaluations based solely on averages aren't much use either...

Jawed
 
Really wierd seeing Rys write for techreport. I love techreport but just wierd. :p
 
When does the statute of limitations expire on NV30? Or is there no such thing in this industry? :LOL:

No such thing. You were there, you remember the whole thing, the hype, the promises, the driver cheating, the Futuremark assassination that spun off of it, the general chaos and FUD that Nvidia spun to hide their NV30 problems ("24 bits are not natural" while they were secretly changing shaders to PP, etc), the dustbuster that they joked about afterwards because their customers were dumb enough to buy it... It was a historical time in the industry, and we've not seen anything like it before or after.
 
Really wierd seeing Rys write for techreport. I love techreport but just wierd. :p

Yeah I know, I didn't even realize it was him till I read the whole article and saw "Ryszard" replying in the comments section :) I just hope it wasn't a puff piece and that the speculation was based on reliable info.

No such thing. You were there, you remember the whole thing, the hype, the promises, the driver cheating, the Futuremark assassination that spun off of it, the general chaos and FUD that Nvidia spun to hide their NV30 problems ("24 bits are not natural" while they were secretly changing shaders to PP, etc), the dustbuster that they joked about afterwards because their customers were dumb enough to buy it... It was a historical time in the industry, and we've not seen anything like it before or after.

Actually I wasn't there....had a 4200 Ti then 9800 Pro at the time. I learned everything I know about NV30 from WaltC years afterward :LOL: The point isn't that it wasn't a bad product, but having it raised over and over by the same people is far more annoying than the dustbuster ever was.
 
Yeah I know, I didn't even realize it was him till I read the whole article and saw "Ryszard" replying in the comments section :) I just hope it wasn't a puff piece and that the speculation was based on reliable info.



Actually I wasn't there....had a 9800 Pro at the time. I learned everything I know about NV30 from WaltC :LOL:

Right, I had too look at the "writer" after I was done reading it. I knew Rys was working on something. But I was just not expecting it to be at techreport.

Chris
 
Actually I wasn't there....had a 4200 Ti then 9800 Pro at the time. I learned everything I know about NV30 from WaltC years afterward :LOL: The point isn't that it wasn't a bad product, but having it raised over and over by the same people is far more annoying than the dustbuster ever was.

Ahh, well, maybe you'd think differently if you'd seen the waves it made in the industry at the time. There was a lot of industry stuff going on that thankfully hasn't been repeated since.
 
The order of computation can have a dramatic effect on the result, the classic example being something like 200-2*(100+0.000000001) in single-precision. FMA versus MUL then ADD alludes to a similar kind of ordering issue, one choice resulting in the use of FMA, the other path resulting in MUL then ADD. So, in scientific computing the programmer would have to be very careful to manage how the PTX generated by CUDA gets translated into hardware instructions. Gulp. Presumably there are hints or maybe explicit PTX instructions. Even with MAD there might be different rounding modes or other fiddlesome details. (ATI has two different MUL instructions, for example.)



In graphics MAD is effectively MUL then ADD, so while there are three possible sequences of computation:
  • MAD
  • MUL then ADD
  • FMA
there's really only MUL then ADD, or FMA to contemplate. If the programmer writes MUL then ADD, the compiler for an older GPU would usually just make that MAD. If the new GPU's compiler does FMA (and there is no MAD) the result will be different but I don't see how that's going to show.

For this to show you'd need a shader that computes this stuff more than once, first time with MUL then ADD, later with FMA. Then do some operation on those two results, which would normally show an error only if they were computed differently but assumed to be the same. But I'd expect the shader compiler to see the same result being computed twice and simply replace the second computation with the result of the first, rather than issue an FMA.

You might argue that dependent texture addressing computations might show a very slight difference, a bias, I suppose. A very large texture would magnify any difference in the precision of MUL then ADD versus FMA, I guess (if a lot of them were performed, to compound errors).

The only other thing I can think of is that a vertex shader does MUL then ADD but a pixel shader uses FMA. If the programmer decides to limit the number of attributes exported by a vertex shader (saving bandwidth, or to avoid trying to export too many attributes), e.g. by making one attribute dependent upon others and then getting the pixel shader to re-compute that value again. There might be a systematic error in the visual result, I suppose.

Variance shadow mapping and its kin are sensitive to precision issues (despite using fp32) and developers use some kind of biasing to minimise artefacts. I get the impression these things are hand-tuned and perhaps there's some artist discretion involved. FMA in the computation might result in subtly different results, enough that someone would want to tweak the bias. Erm...

So you think that, rather than splitting the MAD into 2 different ops to ensure the same result, Fermi would ignore the difference and will try to calculate everything as an FMA, unless the programmers force it to do so (which actually is also what Rys is saying in the TechReport article)?

Coming back to the article, don't you think that maybe Rys is being too bullish on the performances of GF100? I mean, 2x a GTX285 means much quicker than an HD5970. Too much, in my opinion... I would consider being on par with R800 quite an achievement, indeed. I expect something less, actually, something like between RV870 and R800 performance.
 
Minor aside: we've partnered with TR for a long time now, and that kind of content isn't what Scott usually publishes, but it somewhat keen to, at least just as an experiment. So we thought we'd publish the B3D piece there to see how it works in a TR setting, before it gets published here too as normal. So it'll go up on the site in due time, later this week (oh, and I'll even dig the missing X1950 piece out at the same time!).

And yeah, I'm saying that Fermi will run a MADD as an FMA in graphics mode unless the programmer specifies otherwise (which implies programmer control) because they want the old computational accuracy.

Clocks and perf wise I'm bullish, absolutely (and I don't put much stock into the SC'09 data). I could speculate why, but why feed that monster. If I'm wrong then I'll wear an "I love Groo!" t-shirt for a week :LOL:
 
So you think that, rather than splitting the MAD into 2 different ops to ensure the same result, Fermi would ignore the difference and will try to calculate everything as an FMA, unless the programmers force it to do so (which actually is also what Rys is saying in the TechReport article)?

Coming back to the article, don't you think that maybe Rys is being too bullish on the performances of GF100? I mean, 2x a GTX285 means much quicker than an HD5970. Too much, in my opinion... I would consider being on par with R800 quite an achievement, indeed. I expect something less, actually, something like between RV870 and R800 performance.

The HD 5970 is actually a bit faster than a SLI of GTX 285s...

Minor aside: we've partnered with TR for a long time now, and that kind of content isn't what Scott usually publishes, but it somewhat keen to, at least just as an experiment. So we thought we'd publish the B3D piece there to see how it works in a TR setting, before it gets published here too as normal. So it'll go up on the site in due time, later this week (oh, and I'll even dig the missing X1950 piece out at the same time!).

And yeah, I'm saying that Fermi will run a MADD as an FMA in graphics mode unless the programmer specifies otherwise (which implies programmer control) because they want the old computational accuracy.

Clocks and perf wise I'm bullish, absolutely (and I don't put much stock into the SC'09 data). I could speculate why, but why feed that monster. If I'm wrong then I'll wear an "I love Groo!" t-shirt for a week :LOL:

Now this is getting interesting :LOL:
 
So you think that, rather than splitting the MAD into 2 different ops to ensure the same result, Fermi would ignore the difference and will try to calculate everything as an FMA, unless the programmers force it to do so (which actually is also what Rys is saying in the TechReport article)?

Don't think ignoring it is an option on the table.

Coming back to the article, don't you think that maybe Rys is being too bullish on the performances of GF100? I mean, 2x a GTX285 means much quicker than an HD5970. Too much, in my opinion... I would consider being on par with R800 quite an achievement, indeed. I expect something less, actually, something like between RV870 and R800 performance.

I don't think he's overly bullish on performance given those specs. He might be overly bullish on the specs though. One remaining mystery is how much the new memory system alleviates demand on off-chip bandwidth. 4200Mhz GDDR5 seems mighty conservative.

For this to show you'd need a shader that computes this stuff more than once, first time with MUL then ADD, later with FMA. Then do some operation on those two results, which would normally show an error only if they were computed differently but assumed to be the same. But I'd expect the shader compiler to see the same result being computed twice and simply replace the second computation with the result of the first, rather than issue an FMA.

But the compiler won't be able to catch that if the (re)calculation is done in a separate pass (or even a separate shader).

Once you've normalised for a more meaningful, i.e. considerably lower, bandwidth on HD4890, games look more bandwidth-limited on HD5870.

Isn't that shifting the goal-posts a bit? :) In terms of reviews I agree, the data is too rudimentary to draw any serious conclusions with but the best you can do is look at the scenarios that are relevant to you. For me that's 2560x1600 with whatever AA setting is actually playable!

The HD 5970 is actually a bit faster than a SLI of GTX 285s...

Link? TR says otherwise.
 
Back
Top