Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 30-Sep-2009, 20:58   #1
Rys
Tiled
 
Join Date: Oct 2003
Location: Abbots Langley, UK
Posts: 2,779
Default NVIDIA Fermi: Architecture discussion

I don't have anything serious (no discrete article anyway that I can have ready) by now, so I think I'm going to post some details and just talk about it on the forums until I've got something formal ready. HD 5870 needs finishing first really anyway.

The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there):

3.0B transistors @ TSMC, 40nm
2 x 16-way FMA SM, IEEE754-2008, 16 SMs
Each SM has four SFUs
384-bit GDDR5
~650/1700/4200MHz (base/hot/mem)
8 pixels clock address and filter per SM
48 ROPs, 8Z/C clock
64KiB L1/smem per SM (48/16 or 16/48 per clock config, not user programmable as far as I know, at least not yet)
Unified 768 KiB L2 (not partitioned now, so a write to L2 from any SM is visible to all others immediately)
Unified memory space (hardware TLB, 1TiB address, 40-bit if my brain's working)

Each SM dual-issues per clock on two half warps, for two clocks. Instructions can be mixed, so FP+INT, or FP+FP, or SFU+FP, etc. If DP instructions are running, nothing else runs. Although I don't think that's quite right, need to run some CUDA on a chip to test.

1.5K threads per SM in flight (1K in GT200), 32K FP32 registers per SM (up from 16K in GT200).

DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty.

New generation of PTX, CUDA 3.0. C++ in CUDA because of the unified address space.

Some new predication support, although it's really not clear how the hardware makes it happen. Seems you can predicate any instruction.

New atomic performance. Seems like it'll coalesce atomic ops in a warp and won't hit DRAM if the update fails, instead using L2 (GT200 replayed the transaction at DRAM hundreds of clocks later). The whitepaper explanation is wrong.

Seems RF per SM has enough ports (256) and support from the operand fetch hardware to sustain full FMA rate across the chip.

It can run multiple CUDA kernels now at the same time. Limit is 16 per chip (one per SM), but I think that'll be capped at 8.

I think the tesselator is a software pipe with very little hardware support, too.

Anyway, that's from memory, more later when I'm free.



If you want more, dkanter's ready with his (and it's excellent) here.
__________________
Mr. Popples!

Last edited by Rys; 01-Oct-2009 at 21:25. Reason: 8 pixels/clock address and filter
Rys is offline   Reply With Quote
Old 30-Sep-2009, 20:59   #2
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,702
Default

http://www.nvidia.com/content/PDF/fe...Whitepaper.pdf
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 30-Sep-2009, 21:00   #3
Rys
Tiled
 
Join Date: Oct 2003
Location: Abbots Langley, UK
Posts: 2,779
Default

Can we keep this thread free of crap, please (other than my initial post!). Post links to the webcast so people can keep up, and other Fermi pieces from around the web when they pop up. Old thread got a bit silly at times, less of that if poss.
__________________
Mr. Popples!
Rys is offline   Reply With Quote
Old 30-Sep-2009, 21:18   #4
homerdog
hardly a Senior Member
 
Join Date: Jul 2008
Location: still camping with a mauler
Posts: 4,519
NVIDIA NVIDIA's Fermi: Architected for Tesla, 3 Billion Transistors in 2010

http://www.anandtech.com/video/showdoc.aspx?i=3651
homerdog is offline   Reply With Quote
Old 30-Sep-2009, 21:20   #5
McElvis
Member
 
Join Date: Apr 2002
Location: London
Posts: 269
Default

Some info from that anandtech article:

"The price is a valid concern. Fermi is a 40nm GPU just like RV870 but it has a 40% higher transistor count. Both are built at TSMC, so you can expect that Fermi will cost NVIDIA more to make than ATI's Radeon HD 5870.

Then timing is just as valid, because while Fermi currently exists on paper, it's not a product yet. Fermi is late. Clock speeds, configurations and price points have yet to be finalized. NVIDIA just recently got working chips back and it's going to be at least two months before I see the first samples. Widespread availability won't be until at least Q1 2010.

I asked two people at NVIDIA why Fermi is late; NVIDIA's VP of Product Marketing, Ujesh Desai and NVIDIA's VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is "fucking hard"."
McElvis is offline   Reply With Quote
Old 30-Sep-2009, 21:52   #6
Bouncing Zabaglione Bros.
Regular
 
Join Date: Jun 2003
Posts: 6,359
Default

There's no hardware out there yet, and it's going to be expensive and damn difficult to make. I hope Nvidia don't have to compromise the design and give us something less than all these slides promise.

I don't want to raise the spectre of NV30, but the last time we saw this kind of forward publicity from Nvidia was when they were under pressure to produce an extremely ambitious design on a process that wasn't ready for such a complex design, that was running late, and something had to be put out as a spoiler against ATI's recently launched and very successful R300. We got a lot of promises that didn't translate into the finished product

One thing that will work both for and against GF100 is that they seem not to be focussing on the gaming side of things, but are sidestepping into the GPGPU realm. Obviously AMD and Intel may not follow them there, as they have CPUs to sell, but for Nvidia, it might make sense to make this new chip something other than a CPU or a GPU and effectively carve out a new market for themselves.

The only problem will be if gamers no longer see this as a gaming product, and don't go for it. I'm not sure OEMs will want it at what's got to be a higher price than competing products, when it seems to be aimed at the GPGPU segment rather than gaming or general purpose use.

It seems to be an amazing product if it lives up to the hype, but in the same way a Bugatti Veyron is an amazing thing - but it's not one I am likely to buy except for it's gaming/video applications. Where's all the gaming stuff or is Nvidia moving away from that market?
Bouncing Zabaglione Bros. is offline   Reply With Quote
Old 30-Sep-2009, 22:03   #7
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,032
Send a message via Skype™ to fellix
Default

__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 30-Sep-2009, 22:09   #8
ShaidarHaran
hardware monkey
 
Join Date: Mar 2007
Posts: 3,910
Default

So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center? Especially the very center. Scheduling?
ShaidarHaran is offline   Reply With Quote
Old 30-Sep-2009, 22:14   #9
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,702
Default

Rys,

16 pixels/clock address and setup/SM? Are you sure 256 TMUs aren't way too much overkill for that kind of bandwidth?

Also when you state 8Z/8C samples /clock for the ROPs, I assume it's either/or as in today's GPUs?
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 30-Sep-2009, 22:17   #10
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,032
Send a message via Skype™ to fellix
Default

So, the addition of ECC to the GDDR interface would definitely reflect on the chip's perimeter occupancy -- 64+8 bits per channel, for grand total of 432-bit data bus!?
And looks like there will be third revision of NVIO companion ASIC for the thing.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 30-Sep-2009, 22:21   #11
Groo The Wanderer
Member
 
Join Date: Jan 2007
Posts: 334
Default

Quote:
Originally Posted by Rys View Post
The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there):

3.0B transistors @ TSMC, 40nm
2 x 16-way FMA SM, IEEE754-2008, 16 SMs
Each SM has four SFUs
384-bit GDDR5
@1.5GHz/6GHz, but that may only be the current ones.

Quote:
Originally Posted by Rys View Post
~650/1700/4200MHz (base/hot/mem)
Target of 750, I doubt they will be able to do it. Then again, Dear Leader might be flogging the troops until morale improves, and is gunning for higher, but that will likely mean only more delays. See G200 - the worlds first .933TF GPU for more on this.

Quote:
Originally Posted by Rys View Post
DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty.
2:1 ratio, the targets are 1.5TF SP, 768GF DP, but again with the caveat of clocks willing. I have reason to believe they won't be unless you are in the press.

Quote:
Originally Posted by Rys View Post
I think the tesselator is a software pipe with very little hardware support, too.
Gosh, really? Who would have guessed?
http://www.theinquirer.net/inquirer/...0-architecture
Almost like I knew what I was talking about all those months ago. Who would have thought.

-Charlie
Groo The Wanderer is offline   Reply With Quote
Old 30-Sep-2009, 22:27   #12
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,582
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by ShaidarHaran View Post
So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center?
4 humongous piles of L2.
MfA is offline   Reply With Quote
Old 30-Sep-2009, 22:30   #13
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,934
Default

Nice copy-pasting. I mean by NVIDIA's synthesis team, not by you, of course.

Overall, I quite like the SM design - I was expecting the dual-MADD layout for a number of reasons (hint: GT200 didn't expose the full 1024 threads, so I knew it was going to jump to 1536, which meant 6 virtual RF read ports), although I'm surprised they've gone for dual-warp instead of dual-instruction; pleasantly surprised, mind you. I'm not pleasantly surprised by the fact 99% of your execution hardware is taking a nap when doing, say, basic integer operations which are quite important to me. Oh well - you can't please everybody! Even in terms of MUL/ADDs for graphics programs though, it seems rather inefficient.

The SMs, TMUs, and MC-linked blocks are all easy to notice on the die shot. In the bottom left of the central block lie all the 'unique' stuff, conveniently quite near to the PCI Express analogue. What I find interesting, however, is that the MC-linked block is so huge. Seems like a lot of formerly "central" functionality was moved to the MC-linked blocks; I wonder if that includes input assembly and all of its little friends later in the pipeline! (also I really should go on IRC sometime!)
Arun is offline   Reply With Quote
Old 30-Sep-2009, 22:34   #14
jaredpace
Member
 
Join Date: Sep 2009
Posts: 157
Default

jaredpace is offline   Reply With Quote
Old 30-Sep-2009, 22:41   #15
jaredpace
Member
 
Join Date: Sep 2009
Posts: 157
Default



fellix beat me to it
jaredpace is offline   Reply With Quote
Old 30-Sep-2009, 22:43   #16
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 5,002
Default

It will huge, more than 40% of a HD 58xx. Just by the look it doesn't look as tigh as ATI design.
__________________
Sebbbi about virtual texturing
The Law, by Frederic Bastiat
'The more corrupt the state, the more numerous the laws'.
- Tacitus
liolio is online now   Reply With Quote
Old 30-Sep-2009, 23:02   #17
bowman
Member
 
Join Date: Apr 2008
Posts: 141
Default

Oh great, there's a die shot of this newfangled chip but AMD has yet to supply a die shot of Evergreen. Gaaah!

The presentation on NVIDIA's site reminds me of Intel's Nehalem and QPI presentations.

http://www.overclock.net/7295727-post8.html

Tesla AIB unveiled, don't know if it's functional or just a mockup to boost confidence though..
bowman is offline   Reply With Quote
Old 30-Sep-2009, 23:05   #18
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,955
Send a message via Skype™ to Jawed
Default

Anyone seen any TMUs?

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 30-Sep-2009, 23:07   #19
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,032
Send a message via Skype™ to fellix
Default

Some fun at the perimeter:

__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 30-Sep-2009, 23:08   #20
Bob
Member
 
Join Date: Apr 2004
Posts: 421
Default

Quote:
Originally Posted by Arun
the fact 99% of your execution hardware is taking a nap when doing, say, basic integer operations
How so?
__________________
Vincent: G80 is designed for time to market, whereas the R600 is specialized in the rich feature.
Bob is offline   Reply With Quote
Old 30-Sep-2009, 23:10   #21
Dave Baumann
Gamerscore Wh...
 
Join Date: Jan 2002
Posts: 13,588
Default

Quote:
Originally Posted by fellix View Post
Some fun at the perimeter:


Damn, must make a plea to the planners and engineers to put a "cookie monster" in our ASIC's!
__________________
Radeon is Gaming
Tweet Tweet!
Dave Baumann is offline   Reply With Quote
Old 30-Sep-2009, 23:10   #22
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 8,702
Default

Quote:
Originally Posted by Jawed View Post
Anyone seen any TMUs?

Jawed
I've asked already Rys where the 16 pixels/clock address & setup per SM come from but I'm still waiting for his answer. I don't think the 16 load/store units have anything to do with it.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 30-Sep-2009, 23:12   #23
Jawed
Regular
 
Join Date: Oct 2004
Location: London
Posts: 9,955
Send a message via Skype™ to Jawed
Default

http://www.nvidia.com/object/pr_oakridge_093009.html

Quote:
SANTA CLARA, Calif. —Sep. 30, 2009—Oak Ridge National Laboratory (ORNL) announced plans today for a new supercomputer that will use NVIDIA®’s next generation CUDA™ GPU architecture, codenamed “Fermi”. Used to pursue research in areas such as energy and climate change, ORNL’s supercomputer is expected to be 10-times more powerful than today’s fastest supercomputer.

Jeff Nichols, ORNL associate lab director for Computing and Computational Sciences, joined NVIDIA co-founder and CEO Jen-Hsun Huang on stage during his keynote at NVIDIA’s GPU Technology Conference. He told the audience of 1,400 researchers and developers that “Fermi” would enable substantial scientific breakthroughs that would be impossible without the new technology.

“This would be the first co-processing architecture that Oak Ridge has deployed for open science, and we are extremely excited about the opportunities it creates to solve huge scientific challenges,” Nichols said. “With the help of NVIDIA technology, Oak Ridge proposes to create a computing platform that will deliver exascale computing within ten years.”

ORNL also announced it will be creating the Hybrid Multicore Consortium. The goals of this consortium are to work with the developers of major scientific codes to prepare those applications to run on the next generation of supercomputers built using GPUs.

“The first two generations of the CUDA GPU architecture enabled NVIDIA to make real in-roads into the scientific computing space, delivering dramatic performance increases across a broad spectrum of applications,” said Bill Dally, chief scientist at NVIDIA. “The ‘Fermi’ architecture is a true engine of science and with the support of national research facilities such as ORNL, the possibilities are endless.”
Groovy.

Jawed
__________________
Can it play WoW?
Jawed is offline   Reply With Quote
Old 30-Sep-2009, 23:13   #24
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 3,032
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by Dave Baumann View Post


Damn, must make a plea to the planners and engineers to put a "cookie monster" in our ASIC's!
No, no.. the right words are: give me a wallpaper sized RV870 die shot, now!!!1!1one

On topic: Fermi board snapped ...sort of.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.

Last edited by fellix; 30-Sep-2009 at 23:18.
fellix is offline   Reply With Quote
Old 30-Sep-2009, 23:34   #25
3dilettante
Regular
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 5,485
Default

The design put forth really hammers some low-hanging fruit that earlier Nvidia GPUs (and others) lacked.
The multiple kernels, the closed/semi-closed write/read loop, 1/2 DP throughput.

The other stuff is downright crazy to see: indirection, exceptions, IEEE compliance, ECC, simplified addressing.

The mapping of separate memory spaces to lie within the global address space is an elegant way to have the benefits of hardware peculiarity in a more specialized instance without having it impinge the general computation case.

I had sort of thought of a design using special page table bits that would allow hardware to route to special on-chip storage if enabled, and easily forgettable if not.
This isn't quite the same, but the idea of using the target memory location to deliniate special things you want done with it is a rather nice touch.

The size of the chip shows the price of generality, though. FLOP density is not likely to be anywhere near Cypress, and I'd be curious to know if Larrabee's final clocks will mean even the x86 will have an advantage.

I don't know how it will fare in gaming, or how many other problems there may be, but I have to give Nvidia credit: this design took balls.

As far as DP is concerned, the quality of this implementation is enough to make Cypress appear as useful as its botanical namesake in HPC.

Physical and economic realities that may intrude on this (it doesn't exist on a store shelf), but as a topic of discussion, I find this architecture much more interesting to discuss.
The posited tool sets and initiatives are such that this is the first time I've ever thought a GPU designer took serious computation seriously.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote

Reply

Tags
delay, fermi, geforce, gf100

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 10:37.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.