NVIDIA Fermi: Architecture discussion

Jawed · Oct 5, 2009

rpg.314 said:
I am thinking that DP in fermi is closely tied to int and spfp operations, which is why dp is unlikely to be deleted in gaming parts. They'll prolly disable the exceptions, rounding modes, denormals etc. (or some of them) in mid range parts, but retain some dp capability.

http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932&p=7

Each core can execute a DP fused multiply-add (FMA) warp in two cycles by using all the FPUs across both pipelines

This implies that all "32 FPU cores" work together to produce 32 DP-FMAs in two cycles. It seems to imply that INT is not used for DP, and that each pipeline produces 16 FMAs in two cycles. But the instruction is despatched to both pipelines, from a single warp.

WHY? you ask.

Int mul is 32 bits, and sp mul is 23 bits. If you use both of them, you get 55 bits of multiplication. dp needs 52 bits. Just about right for doing dp fma.

There's also the question of the need to implement subnormals, which in GT200 seemingly carries the cost of a 168-bit adder. Does GF100 have an adder like that?

This could be a reason to have so much int power, and how they are managing to but so much dp in gaming parts (albeit high end) without going bankrupt. This could be why you can dual issue spfp and int's, but dp doesn't dual issue with anything else.

Except SP and INT can't be dual-issued within a pipeline.

Jawed

Voxilla · Oct 5, 2009

DegustatoR said:
Isn't Fermi more interesting to you as a developer?

Well, Fermi isn't available now. If you want to do DX11 now HD 5870 is the only option. From a point of view of DirectX11 I don't see why Fermi would be more interesting.

Last year I got the HD 4870 first too but later switched to the GTX280 on my main development machine, mainly because it had twice the amount of memory. I might be tempted too this time to switch to Fermi, but from what I know about it I doubt.

DegustatoR said:
Shouldn't you as a developer have hardware from all key vendors? (Not mentioning that vendors usually provide h/w to key developers for free and some time before this h/w becomes avialable to consumers.)

Personally I have to make a distinction between my professional work, and my 'hobby' work. At work we develop for a non gaming market and use the professional cards, these lag for quite some time to the consumer cards and we don't get them for free. For development they can be bought at reduced prices, at least with NVidia. We also use ATI cards for another product, so we use both brands. In my development machine there is both an ATI and Nvidia, both usable under XP, not so under Vista. So yes we develop and test for all kinds of cards.

For my hobby work I stick to one card, as with Vista no two brands can be used in the same machine, Windows 7 will fix that apparently. And yes, probably key game developers get free and preliminary hardware. I remember the days where I got a couple of ATI 9700pro's and 8500s for free.

chavvdarrr · Oct 5, 2009

DegustatoR said:
Isn't Fermi more interesting to you as a developer? Shouldn't you as a developer have hardware from all key vendors? (Not mentioning that vendors usually provide h/w to key developers for free and some time before this h/w becomes avialable to consumers.)

Fermi is on sale?! Where?

Also writing for Rv8xx, there is big chance that this software will run on every DX11 card.
Writing for Fermi will mean it will run only on future NV hardware.
With NO date set for HF100 availability and NO date set for low/mid videocards based on GF100 architecture... what makes you think all developers shouldn't buy AMD hardware and instead wait for NV?!

Enforcer · Oct 5, 2009

For anyone wondering about DP ALU power/size/cost:
http://www.notur.no/news/inthenews/files/exascale_final_report_100208.pdf
Page 176:

DP-FMA FPU on aggressive voltage/frequency on 45nm node:
- 0.13 mm^2
- 2 GHz
- 0.09 Watt

One can estimate ~33 mm^2 and 23 W for 256 DP-FMA units in Fermi.

Creig · Oct 5, 2009

DegustatoR said:
High end videocards as Christmas presents? Wouldn't a PS3 be a better Christmas present from any pov?

I have received video cards from my wife as Christmas presents in the past. If I feel the need to upgrade and it's around that time of year, I expect it will be on my 'wish list' once again.

Ailuros · Oct 5, 2009

Enforcer said:
For anyone wondering about DP ALU power/size/cost:
http://www.notur.no/news/inthenews/files/exascale_final_report_100208.pdf
Page 176:

DP-FMA FPU on aggressive voltage/frequency on 45nm node:
- 0.13 mm^2
- 2 GHz
- 0.09 Watt

One can estimate ~33 mm^2 and 23 W for 256 DP-FMA units in Fermi.

Are those separate units in GF100 for such an estimate to make sense?

rpg.314 · Oct 5, 2009

Jawed said:
This implies that all "32 FPU cores" work together to produce 32 DP-FMAs in two cycles. It seems to imply that INT is not used for DP, and that each pipeline produces 16 FMAs in two cycles. But the instruction is despatched to both pipelines, from a single warp.

Except SP and INT can't be dual-issued within a pipeline.

Jawed

Which begs the question, why bother with so much int mul? It's not like they are gonna have terabytes of memory in fermi, so address calculation don't need so much precision right now.

Enforcer · Oct 5, 2009

Ailuros said:
Are those separate units in GF100 for such an estimate to make sense?

DP multipliers are 53x53, and take much more size than 2 single precision 24x24 multipliers, so even if some parts of logic can be shared, the estimate is quite fair.
There is additional cost for re-using as well, example:
http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf

The proposed MAF unit can perform either
one double-precision or two parallel single-precision
operations using about 18% more hardware than a conventional
double-precision MAF unit and with 9% increase in
delay.

P.S. My point is that gamers/consumers can easily tolerate 5-10% increase in cost and power,
and get DP and INT32 MUL functionality as a bonus.

rpg.314 · Oct 5, 2009

May be they are using the same 32 bit multipliers for both sp and int.

3dilettante · Oct 5, 2009

Aside from the Int units themselves, there is already much of the infrastructure in place.

Whatever the hardware cost of a full 32-bit multiplier, it still sits behind the already established register file, read ports, operand collectors, and two rather complex schedulers. If it shares hardware with DP floating point units that would be inactive anyway, so much the better.

The integer multiply is half-speed and DP FMA is half-speed, coincidence?

Sampsa · Oct 5, 2009

7 years ago NVIDIA tested packed GPUs from Fab like this:

We know Fermi is a working silicon:

So I wonder how does today's Fermi test board look like? Does it really have a lot of wires sticking out, as well as test modules, and looks like a character from Terminator like Fudo describes or is it something similar to year 2002's test board

Tim · Oct 5, 2009

Sampsa said:
7 years ago NVIDIA tested packed GPUs from Fab like this:

So I wonder how does today's Fermi test board look like? Does it really have a lot of wires sticking out, as well as test modules, and looks like a character from Terminator like Fudo describes or is it something similar to year 2002's test board

If they are launching in November like Fudo claims (or even December) they should have working chips on production level PCPs by now, it would pretty unusual/dangerous to start mass-production without producing prototype-boards.

Edit: No reason to quote all the pictures.

Jawed · Oct 5, 2009

Enforcer said:
DP multipliers are 53x53, and take much more size than 2 single precision 24x24 multipliers, so even if some parts of logic can be shared, the estimate is quite fair.
There is additional cost for re-using as well, example:
http://www.lirmm.fr/arith18/papers/libo-multipleprecisionmaf.pdf

P.S. My point is that gamers/consumers can easily tolerate 5-10% increase in cost and power,
and get DP and INT32 MUL functionality as a bonus.

Ooh, that's very useful, will have a proper read of that at some point.

It's worth noting that a multi-precision unit is an 83% overhead on 2 single-precision FMA units. 708590 um² versus 386384 (2*193417) (180nm library).

But this is swamped by the non-math portion of the cores. I'm guessing that the DP/SP ALUs amount to ~15% of each core. Assuming the die is 480mm², with about 46% being cores (judging from the die shot), that's 33mm² (your earlier estimate on 45nm) out of 220mm² of cores.

Jawed

apoppin · Oct 5, 2009

sethk said:
I'm wondering if the consumer part will have 512 'cores' enabled, especially at launch. I think in order to keep their volumes up, they may have save the fully enabled cores for Fermi (dedicated GPGPU) parts, and have a smaller core count on even the top-end consumer part. If the performance lead is really as large as they were hinting, that is.

If they do lead with a 'GTX 380' type part with less than 512 cores, they may eventually release a consumer part with all cores enabled, say a 385, once yeilds are up, or when they have more competition, such as a 5890.

They said they designed it to be modular to keep all of their options open - just as in the past.

rjc · Oct 6, 2009

Hotboy at it again:

老黄拿的卡据说可以跑Ｘ１００００　还是频率不高的情况下．..３００的版本每次更新总有点惊喜啊
"Old Huang to take hold of card, it is said can run X10000, still frequency not high in these circumstances. 300 version to upgrade overall a little surprised!"

So depending (a fair bit) on the system used is maybe a little above GTX295 performance at the moment. Likely will get better if they can get the frequency up and with a more mature driver.

Translation note: 老 is more like venerable or experienced or well known sort of affectionate term rather than old which sounds kind of harsh isn't it?

Tchock · Oct 6, 2009

Nah, everyone on chiphell is used to calling JHH's nick

Florin · Oct 6, 2009

But what does that mean? Specifically, the 'it is said can run X10000, still frequency not high in these circumstances'?

trinibwoy · Oct 6, 2009

Presumably Vantage extreme score. Haven't heard anything about target clocks though so it's hard to say what's low.

dkanter · Oct 6, 2009

IMUL in fermi is 32b now, but the latency is 4 cycles (vs. 2 for IADD).

Also, ALUs and AGUs are different animals.

David

Ailuros · Oct 6, 2009

trinibwoy said:
Presumably Vantage extreme score. Haven't heard anything about target clocks though so it's hard to say what's low.

I should have used an online translator heh....so unless I've picked the wrong chinese translation it could be:

Lao Huang takes the card it is said may run X10000 or in the frequency not high situation. .300 editions each time renew always a little are pleasantly surprised. At that time the frequency was 495/1100/1000.

http://babelfish.yahoo.com/translat...ead.php?tid=56185&lp=zh_en&btnTrUrl=Translate

NVIDIA Fermi: Architecture discussion

Jawed

Voxilla

chavvdarrr

Enforcer

Creig

Ailuros

Epsilon plus three

rpg.314

Enforcer

rpg.314

3dilettante

Sampsa

Tim

Jawed

apoppin

rjc

Tchock

Florin

Merrily dodgy

trinibwoy

Meh

dkanter

Ailuros

Epsilon plus three

Similar threads