Hopefully cause if there are issues with this then it will be a big headache for the driver team.
The order of computation can have a dramatic effect on the result, the classic example being something like 200-2*(100+0.000000001) in single-precision. FMA versus MUL then ADD alludes to a similar kind of ordering issue, one choice resulting in the use of FMA, the other path resulting in MUL then ADD. So, in scientific computing the programmer would have to be very careful to manage how the PTX generated by CUDA gets translated into hardware instructions. Gulp. Presumably there are hints or maybe explicit PTX instructions. Even with MAD there might be different rounding modes or other fiddlesome details. (ATI has two different MUL instructions, for example.)
In graphics MAD is effectively MUL then ADD, so while there are three possible sequences of computation:
there's really only MUL then ADD, or FMA to contemplate. If the programmer writes MUL then ADD, the compiler for an older GPU would usually just make that MAD. If the new GPU's compiler does FMA (and there is no MAD) the result will be different but I don't see how that's going to show.
For this to show you'd need a shader that computes this stuff more than once, first time with MUL then ADD, later with FMA. Then do some operation on those two results, which would normally show an error only if they were computed differently but assumed to be the same. But I'd expect the shader compiler to see the same result being computed twice and simply replace the second computation with the result of the first, rather than issue an FMA.
You might argue that dependent texture addressing computations might show a very slight difference, a bias, I suppose. A very large texture would magnify any difference in the precision of MUL then ADD versus FMA, I guess (if a lot of them were performed, to compound errors).
The only other thing I can think of is that a vertex shader does MUL then ADD but a pixel shader uses FMA. If the programmer decides to limit the number of attributes exported by a vertex shader (saving bandwidth, or to avoid trying to export too many attributes), e.g. by making one attribute dependent upon others and then getting the pixel shader to re-compute that value again. There might be a systematic error in the visual result, I suppose.
Variance shadow mapping and its kin are sensitive to precision issues (despite using fp32) and developers use some kind of biasing to minimise artefacts. I get the impression these things are hand-tuned and perhaps there's some artist discretion involved. FMA in the computation might result in subtly different results, enough that someone would want to tweak the bias. Erm...
My initial thought was much simpler than that - i.e efficiency improvements aside, could Fermi hope to challenge HD5970's nominal bandwidth with current GDDR5 modules on a 384-bit bus. At first I was skeptical because I assumed much higher mem clocks for HD 5970.
I have to admit I don't consider HD5970 a meaningful competitor (no matter what AMD thinks, AFR and driver yuk) so my perspective is purely about competing with HD5870, GTX285 and GTX295 (GTX295 only because of what happened to GTX280 in comparison with 9800GX2).
I'm a bit less skeptical now after realizing it only goes up to 256GB/s. But that's probably a useless comparison anyway given what we've seen with theoretical bandwidth numbers being essentially meaningless as a performance indicator.
Well, I've long said that HD4870 (and HD4890 by implication) have way more bandwidth than they need. AMD admits as such in the HD5870 slides. So bandwidth comparisons are up-ended by starting with "the wrong baseline".
Separately, give HD5870 something to get its teeth into (Arma 2 is good) and it'll happily show very significant scaling. Stalker Clear Sky shows promise too:
http://www.computerbase.de/artikel/...adeon_hd_5970/15/#abschnitt_stalker_clear_sky
Comparisons on efficiency are difficult and really comes down to how you want to spin it. HD5870 has only about 25% more bandwidth than HD4890 yet manages to outrun it by 60-70% on average. That's great when looking at bandwidth. But what about the fact that 60-70% is lower than the 100% increase in texturing and shading resources? Is it bandwidth efficient or simply inefficient in those other areas.
Once you've normalised for a more meaningful, i.e. considerably lower, bandwidth on HD4890, games look more bandwidth-limited on HD5870. Additionally overclocking results seem to show a degree of bandwidth sensitivity in HD5870. To be honest it's a subject I haven't delved into, partly because immature drivers get in the way. e.g. the revised interpolation scheme theoretically has a significant knock-on effect just in terms of the ALUs, let alone TUs and bandwidth.
You could do a similiar exercise for GT200 and G92 and claim GT200 was really efficient at using its texture or shader units because the performance gain was much higher than the theoretical increase. Yet everybody still pans it for inefficiency no?
NVidia did make some significant improvements in GT200. R600 flattered G80, to a degree...
I've been wondering how things would have been if HD4870 had been made with 32 RBEs instead of 16. That spare die space that was assigned to 2 extra cores (about 8% of the die) would have been enough for doubling the RBEs (~4% of the die for 16), I reckon. For the die space I think you could argue HD4870 was pretty wasteful. But since the die size "miss" was pretty big I guess we have to accept there's a lot more opportunism in the final die when the chip design undergoes radical change.
Interestingly the GF100 boards that we've seen so far seem not to have an NVIO chip. Could that reflect the fact that the implementation is more measured than G80 or GT200, i.e. NVidia has a much clearer idea of how the chip's going to pan out? Or is that a side effect of 40nm woes, allowing more time? Or, could it be a reflection of CUDA customers wanting video output?...
Yeah but those more egregious examples don't support the argument for using 8xAA as an "average case"
Of course there are very valid reasons why you think 8xAA should be used in reviews (why shouldn't it?) but it'll be very hard to promote that as a typical scenario. In other words, if you had to choose a single setting that gave you an idea of the general performance standings of the various architectures and products based on them would you choose 8xAA?
For enthusiast cards, undoubtedly.
The first goal is achievable I think, the latter not so much. There is no such thing as de-facto settings in a world where everybody has different tastes and more importantly, different monitors - this is the main reason why I find Hardocp's approach particularly useless.
I like their approach - between the formal reviews and the game-specific card tests I think they do the best job in communicating the real value of cards to gamers. I also think X-bit's reviews, with their bias towards brutal settings, give a good impression of how a card performs. Most English-language sites paint a rosy picture in my view, meaning that they only provide some data that can be used in trying to assess the comparative performance of cards. Usually they fail at that. And then there's the shenanigans with varying driver versions for each IHV within a review, which mucks things up.
To be honest, these days the data is so bad at most sites I get bored rooting around trying to find something to use when trying to decide how things are scaling. But then, evaluations based solely on averages aren't much use either...
Jawed