NVIDIA Fermi: Architecture discussion

A.L.M. · Nov 24, 2009

Hmmm, but going back to the MADD-to-FMA thing that Fermi will supposedly do...
How many cycles could it take? Is it something that needs to be repeated for each MADD?
e.g.: if you have a 20 MADD shader, Fermi will need to do 20 substitutions or just one at the beginning?
What kind of performance impact it could have?

trinibwoy · Nov 24, 2009

Dave Baumann said:
Are you only looking at the one benchmark? Everything else is in the 50-170FPS range!

Battleforge, HAWX, DowII, RE5, Batman:AA, Left4Dead are "one" benchmark? They all have 285-SLI within a few frames of HD5970. And you're right - when you're in the 170fps range a few frames difference is meaningless.

A.L.M. said:
Hmmm, but going back to the MADD-to-FMA thing that Fermi will supposedly do...
How many cycles could it take? Is it something that needs to be repeated for each MADD?
e.g.: if you have a 20 MADD shader, Fermi will need to do 20 substitutions or just one at the beginning?
What kind of performance impact it could have?

Not sure what you mean. This replacement will be done in the compiler, it won't take any cycles.

A.L.M. · Nov 24, 2009

trinibwoy said:
Not sure what you mean. This replacement will be done in the compiler, it won't take any cycles.

I thought it should work like this....

The thread processor takes the instruction, reads it, interpretes it and sends it to execution on the basis of low level programming parameters. Drivers may take a part in this, by modifying the execution of the instruction itself. In this case it can "suggest" to the chip of substituting a MADD with an FMA. The thread processor then takes the instruction, decodes it and after understanding that is a MADD it substitutes it with an FMA and then it sends it to execution.

Is that possible that all this process has not even a small computational cost? :?:

trinibwoy · Nov 24, 2009

No, it doesn't work that way. The hardware doesn't even know what a MADD is anymore, there's only FMA. Any replacement will be done in the compiler.

A.L.M. · Nov 24, 2009

trinibwoy said:
No, it doesn't work that way. The hardware doesn't even know what a MADD is anymore, there's only FMA. Any replacement will be done in the compiler.

OK.
But is it possible that doing this kind of process has not even a small performance or efficiency impact? I find it impossible to believe...

I mean, experience tells us that when you rely too much on your compiler or on your drivers, or in any other kind of complication to do things, 9 out of 10 you're gonna fail miserably. This happened with NV30, it happened with R600 and will continue to happen to all the other gpu that would pursue that kind of approach...
For example I hope that voices about the missing hw tessellator are not true, otherwise I fear we're gonna see another R600.

Dave Baumann · Nov 24, 2009

trinibwoy said:
Battleforge, HAWX, DowII, RE5, Batman:AA, Left4Dead are "one" benchmark? They all have 285-SLI within a few frames of HD5970.

Yet the performance difference in half of them are 18%-20%.

neliz · Nov 24, 2009

NV has been doing that compiler translation since NV30 (read the interview a couple of pages back) where jen-hsun insinuates that the bad NV30 performance was caused by the mediocre translation of DX9 to cG.
If the translation process of DX10/11 is "cheap" then one can assume great performance.

mczak · Nov 24, 2009

A.L.M. said:
OK.
But is it possible that doing this kind of process has not even a small performance or efficiency impact? I find it impossible to believe...
I mean, experience tells us that when you rely too much on your compiler or on your drivers, or in any other kind of complication to do things, 9 out of 10 you're gonna fail miserably. This happened with NV30, it happened with R600 and will continue to happen to all the other gpu that would pursue that kind of approach...

NV30 was quite a different case, with its fast fx12(?) and slow fp16 (and even slower fp32) support. A complicated beast, near impossible in some generic way to figure out what precision shaders really need (since spec called for fp24 at least) and how to map that best to hardware.
But really, replacing mad with fma should be trivial, as long as you don't care the result could be slightly different. That's not really even a optimizing task, which are far more complicated.

trinibwoy · Nov 24, 2009

A.L.M. said:
I mean, experience tells us that when you rely too much on your compiler or on your drivers, or in any other kind of complication to do things, 9 out of 10 you're gonna fail miserably. This happened with NV30, it happened with R600 and will continue to happen to all the other gpu that would pursue that kind of approach...

The compiler optimizations for NV30 and R600 aren't the same thing as what we're talking about here. This is just replacing an instruction with a higher precision version. There's no performance implication for that trivial replacement.

It only gets tricky when you have to detect cases like Jawed pointed out where the developer or compiler assumes that MUL+ADD=MAD. That translates to an assumption in Fermi that MUL+ADD=FMA which could lead to invalid results.

Dave Baumann said:
Yet the performance difference in half of them are 18%-20%.

That's precisely my point Dave. HD5970 is only 20% faster than Nvidia's old stuff and the advantage is much lower than that in many cases. It all depends on GF100 of course but the current numbers indicate that we won't see a repeat of this generation where GTX295 was later than HD4870X2 and not much faster. It looks like GF100x2 will be much later but could be much faster as well.

neliz · Nov 24, 2009

trinibwoy said:
That's precisely my point Dave. HD5970 is only 20% faster than Nvidia's old stuff and the advantage is much lower than that in many cases. It all depends on GF100 of course but the current numbers indicate that we won't see a repeat of this generation where GTX295 was later than HD4870X2 and not much faster. It looks like GF100x2 will be much later but could be much faster as well.

In that case it might not be going up against a HD5970 then will it? It's pretty pointless to assume the leadership of a card that might very well be a year out.

Dave Baumann · Nov 24, 2009

trinibwoy said:
That's precisely my point Dave. HD5970 is only 20% faster than Nvidia's old stuff and the advantage is much lower than that in many cases. It all depends on GF100 of course but the current numbers indicate that we won't see a repeat of this generation where GTX295 was later than HD4870X2 and not much faster. It looks like GF100x2 will be much later but could be much faster as well.

GTX 295 and 5970 are capped by the the ecosystem of the power specification.

mczak · Nov 24, 2009

trinibwoy said:
That's precisely my point Dave. HD5970 is only 20% faster than Nvidia's old stuff and the advantage is much lower than that in many cases. It all depends on GF100 of course but the current numbers indicate that we won't see a repeat of this generation where GTX295 was later than HD4870X2 and not much faster. It looks like GF100x2 will be much later but could be much faster as well.

I'm not sold on the GF100x2 neither yet. From the hints we got so far, it certainly looks like GF100 will approach the 225W limit even more than the HD5870. Hence, any potential GF100x2 would also need some tuning (less units, lower voltage, lower clock), unless nvidia is willing to go with 2 8 pin connectors (and if they'd do that I guess AMD could as well get out such a card). Plus they'd run into similar scaling issues as the HD5970 does (though granted SLI still seems to scale better on average for some odd reason compared to Crossfire). According to the hardware.fr numbers, the HD5970 also isn't quite as fast as it should be - not sure what's going on there though and if drivers will be able to fix this.

CarstenS · Nov 24, 2009

WRT the FMA-stuff: I'm taking bets now, when we'll see the first game developers specifically calling for this "superior precision" in their games - motivated by whatever...

And then, as a follow-up, who will launch the first attack on the ppt-front against his opponent.

2010 will be fun!

mczak said:
(and if they'd do that I guess AMD could as well get out such a card).

The reference board already has provisions for 2x8 pin.

mczak · Nov 24, 2009

CarstenS said:
The reference board already has provisions for 2x8 pin.

That's true. But there's a difference if some official card (with higher clocks) exists, or if simply the opportunity is there for some OEM to create some OC card.

Jawed · Nov 24, 2009

trinibwoy said:
But the compiler won't be able to catch that if the (re)calculation is done in a separate pass (or even a separate shader).

Yeah, that's why I included the case of vertex shader and then pixel shader, with one being MUL then ADD and the other being FMA.

I think it needs someone who has code with some uber-dependency on precision to speak up! This is all very theoretical, I'm struggling to think of a scenario where this doesn't just end up looking like noise.

Isn't that shifting the goal-posts a bit?

I can't see how.

Anyway, I've thought of a work-around for HD5870 scaling/bandwidth-sensitivity questions that makes HD4890 irrelevant. It's bloody simple, compare it with HD5770!

http://www.computerbase.de/artikel/...ire/19/#abschnitt_performancerating_qualitaet

Here you can see that as the rendering workload increases the performance margin for HD5870 goes from 71% up to 82% and then falls back to 81%, while the theoretical margin is 100% on every single parameter: texture rate, fillrate, GB/s and FLOPS.

So, erm, how much is drivers? How much is CPU limitation?

The results here:

http://www.computerbase.de/artikel/...970/18/#abschnitt_performancerating_qualitaet

are different, some driver differences I guess (more-mature HD5770 drivers?). Slightly less scaling in this case.

Since HD5770 is generally accepted as being bandwidth limited (9% more bandwidth than GTS250, with 12-47% more performance, or 26-54% more performance than HD4850 with 21% more bandwidth), HD5870 is too. But HD5870 is more likely to be CPU-limited which will have a relatively larger effect on throughput (and actually make it look slightly less bandwidth-limited than HD5770).

In terms of reviews I agree, the data is too rudimentary to draw any serious conclusions with but the best you can do is look at the scenarios that are relevant to you. For me that's 2560x1600 with whatever AA setting is actually playable!

Do you ever use >4xMSAA on your GTX285?

Jawed

trinibwoy · Nov 24, 2009

neliz said:
In that case it might not be going up against a HD5970 then will it? It's pretty pointless to assume the leadership of a card that might very well be a year out.

It's just as pointless to assume that it will be a year out. Is it so hard to accept that it's not running away from last generation parts as much as it should? The numbers are there.

Dave Baumann said:
GTX 295 and 5970 are capped by the the ecosystem of the power specification.

Of course, but I'm not sure what bearing that has on the performance of a 300w Fermi based part.

mczak said:
I'm not sold on the GF100x2 neither yet.

Neither am I. I'm not sold on it being much faster than HD5870 even. But you can either go on reasonable assumptions or you can do like neliz and assume the worst possible outcome in all aspects - timing, performance, power consumption etc.

Rys · Nov 24, 2009

CarstenS said:
WRT the FMA-stuff: I'm taking bets now, when we'll see the first game developers specifically calling for this "superior precision" in their games - motivated by whatever...

In graphics mode they'll get it for free unless they ask for it to be switched off. It'll be the default. It should be clear from the article that running the old MUL + ADD is two clocks, sorry if it's not.

Jaaanosik · Nov 24, 2009

Dave Baumann said:
GTX 295 and 5970 are capped by the the ecosystem of the power specification.

Not only electrical power but these cards are so powerful that they need top notch computers in order to spread their wings.

Dave, when 5970 is in a 16x PCI-e slot do both GPU's 'fight' for all 16 lanes or they split them 8 and 8 (if it's possible)? Thanks...

Razor1 · Nov 24, 2009

there is more then enough bandwidth in the PCI-e

trinibwoy · Nov 24, 2009

Jawed said:
http://www.computerbase.de/artikel/...ire/19/#abschnitt_performancerating_qualitaet

Here you can see that as the rendering workload increases the performance margin for HD5870 goes from 71% up to 82% and then falls back to 81%, while the theoretical margin is 100% on every single parameter: texture rate, fillrate, GB/s and FLOPS.

So, erm, how much is drivers? How much is CPU limitation?

Or how much is due to the capacity of other internal blocks that aren't revealed by fillrate, flops or bandwidth numbers?

Do you ever use >4xMSAA on your GTX285?

Sure, I even use SSAA wherever I can. But good luck doing so on anything recent.

-------------------------------------------------------------------------------------------------------------------------

In other news, Fudo is back on the January bandwagon.

http://www.fudzilla.com/content/view/16544/1/

We have also heard that the launch won't take place at CES 2010, which starts on January 7th and ends of January 10th. Instead, Nvidia might launch it later at some point, but nevertheless still sometime during the month.

The April rumours do seem more credible at this point though. If it was launching in January surely there would be more ES cards floating around by now. It's almost December....

NVIDIA Fermi: Architecture discussion

A.L.M.

trinibwoy

Meh

A.L.M.

trinibwoy

Meh

A.L.M.

Dave Baumann

Gamerscore Wh...

neliz

GIGABYTE Man

mczak

trinibwoy

Meh

neliz

GIGABYTE Man

Dave Baumann

Gamerscore Wh...

mczak

CarstenS

Moderator

mczak

Jawed

trinibwoy

Meh

Rys

Graphics @ AMD

Jaaanosik

Razor1

trinibwoy

Meh

Similar threads