NVIDIA Fermi: Architecture discussion

Rys

Graphics @ AMD
Moderator
Veteran
Supporter
I don't have anything serious (no discrete article anyway that I can have ready) by now, so I think I'm going to post some details and just talk about it on the forums until I've got something formal ready. HD 5870 needs finishing first really anyway.

The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there):

3.0B transistors @ TSMC, 40nm
2 x 16-way FMA SM, IEEE754-2008, 16 SMs
Each SM has four SFUs
384-bit GDDR5
~650/1700/4200MHz (base/hot/mem)
8 pixels clock address and filter per SM
48 ROPs, 8Z/C clock
64KiB L1/smem per SM (48/16 or 16/48 per clock config, not user programmable as far as I know, at least not yet)
Unified 768 KiB L2 (not partitioned now, so a write to L2 from any SM is visible to all others immediately)
Unified memory space (hardware TLB, 1TiB address, 40-bit if my brain's working)

Each SM dual-issues per clock on two half warps, for two clocks. Instructions can be mixed, so FP+INT, or FP+FP, or SFU+FP, etc. If DP instructions are running, nothing else runs. Although I don't think that's quite right, need to run some CUDA on a chip to test.

1.5K threads per SM in flight (1K in GT200), 32K FP32 registers per SM (up from 16K in GT200).

DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty.

New generation of PTX, CUDA 3.0. C++ in CUDA because of the unified address space.

Some new predication support, although it's really not clear how the hardware makes it happen. Seems you can predicate any instruction.

New atomic performance. Seems like it'll coalesce atomic ops in a warp and won't hit DRAM if the update fails, instead using L2 (GT200 replayed the transaction at DRAM hundreds of clocks later). The whitepaper explanation is wrong.

Seems RF per SM has enough ports (256) and support from the operand fetch hardware to sustain full FMA rate across the chip.

It can run multiple CUDA kernels now at the same time. Limit is 16 per chip (one per SM), but I think that'll be capped at 8.

I think the tesselator is a software pipe with very little hardware support, too.

Anyway, that's from memory, more later when I'm free.



If you want more, dkanter's ready with his (and it's excellent) here.
 
Last edited by a moderator:
Can we keep this thread free of crap, please (other than my initial post!). Post links to the webcast so people can keep up, and other Fermi pieces from around the web when they pop up. Old thread got a bit silly at times, less of that if poss.
 
Some info from that anandtech article:

"The price is a valid concern. Fermi is a 40nm GPU just like RV870 but it has a 40% higher transistor count. Both are built at TSMC, so you can expect that Fermi will cost NVIDIA more to make than ATI's Radeon HD 5870.

Then timing is just as valid, because while Fermi currently exists on paper, it's not a product yet. Fermi is late. Clock speeds, configurations and price points have yet to be finalized. NVIDIA just recently got working chips back and it's going to be at least two months before I see the first samples. Widespread availability won't be until at least Q1 2010.

I asked two people at NVIDIA why Fermi is late; NVIDIA's VP of Product Marketing, Ujesh Desai and NVIDIA's VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is "fucking hard"."
 
There's no hardware out there yet, and it's going to be expensive and damn difficult to make. I hope Nvidia don't have to compromise the design and give us something less than all these slides promise.

I don't want to raise the spectre of NV30, but the last time we saw this kind of forward publicity from Nvidia was when they were under pressure to produce an extremely ambitious design on a process that wasn't ready for such a complex design, that was running late, and something had to be put out as a spoiler against ATI's recently launched and very successful R300. We got a lot of promises that didn't translate into the finished product

One thing that will work both for and against GF100 is that they seem not to be focussing on the gaming side of things, but are sidestepping into the GPGPU realm. Obviously AMD and Intel may not follow them there, as they have CPUs to sell, but for Nvidia, it might make sense to make this new chip something other than a CPU or a GPU and effectively carve out a new market for themselves.

The only problem will be if gamers no longer see this as a gaming product, and don't go for it. I'm not sure OEMs will want it at what's got to be a higher price than competing products, when it seems to be aimed at the GPGPU segment rather than gaming or general purpose use.

It seems to be an amazing product if it lives up to the hype, but in the same way a Bugatti Veyron is an amazing thing - but it's not one I am likely to buy except for it's gaming/video applications. Where's all the gaming stuff or is Nvidia moving away from that market?
 
dieshot1s.jpg
 
So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center? Especially the very center. Scheduling?
 
Rys,

16 pixels/clock address and setup/SM? Are you sure 256 TMUs aren't way too much overkill for that kind of bandwidth?

Also when you state 8Z/8C samples /clock for the ROPs, I assume it's either/or as in today's GPUs?
 
So, the addition of ECC to the GDDR interface would definitely reflect on the chip's perimeter occupancy -- 64+8 bits per channel, for grand total of 432-bit data bus!?
And looks like there will be third revision of NVIO companion ASIC for the thing. :D
 
The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there):

3.0B transistors @ TSMC, 40nm
2 x 16-way FMA SM, IEEE754-2008, 16 SMs
Each SM has four SFUs
384-bit GDDR5

@1.5GHz/6GHz, but that may only be the current ones.

~650/1700/4200MHz (base/hot/mem)

Target of 750, I doubt they will be able to do it. Then again, Dear Leader might be flogging the troops until morale improves, and is gunning for higher, but that will likely mean only more delays. See G200 - the worlds first .933TF GPU for more on this.

DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty.

2:1 ratio, the targets are 1.5TF SP, 768GF DP, but again with the caveat of clocks willing. I have reason to believe they won't be unless you are in the press.

I think the tesselator is a software pipe with very little hardware support, too.

Gosh, really? Who would have guessed?
http://www.theinquirer.net/inquirer/news/1137331/a-look-nvidia-gt300-architecture
Almost like I knew what I was talking about all those months ago. Who would have thought. :)

-Charlie
 
Nice copy-pasting. I mean by NVIDIA's synthesis team, not by you, of course.

Overall, I quite like the SM design - I was expecting the dual-MADD layout for a number of reasons (hint: GT200 didn't expose the full 1024 threads, so I knew it was going to jump to 1536, which meant 6 virtual RF read ports), although I'm surprised they've gone for dual-warp instead of dual-instruction; pleasantly surprised, mind you. I'm not pleasantly surprised by the fact 99% of your execution hardware is taking a nap when doing, say, basic integer operations which are quite important to me. Oh well - you can't please everybody! Even in terms of MUL/ADDs for graphics programs though, it seems rather inefficient.

The SMs, TMUs, and MC-linked blocks are all easy to notice on the die shot. In the bottom left of the central block lie all the 'unique' stuff, conveniently quite near to the PCI Express analogue. What I find interesting, however, is that the MC-linked block is so huge. Seems like a lot of formerly "central" functionality was moved to the MC-linked blocks; I wonder if that includes input assembly and all of its little friends later in the pipeline! (also I really should go on IRC sometime!)
 
It will huge, more than 40% of a HD 58xx. Just by the look it doesn't look as tigh as ATI design.
 
Oh great, there's a die shot of this newfangled chip but AMD has yet to supply a die shot of Evergreen. Gaaah! :devilish:

The presentation on NVIDIA's site reminds me of Intel's Nehalem and QPI presentations.

http://www.overclock.net/7295727-post8.html

Tesla AIB unveiled, don't know if it's functional or just a mockup to boost confidence though..
 
Back
Top