NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    I don't have anything serious (no discrete article anyway that I can have ready) by now, so I think I'm going to post some details and just talk about it on the forums until I've got something formal ready. HD 5870 needs finishing first really anyway.

    The big highlights (some is guesswork, NV won't talk about graphics transistors or clocks today, so beware I might be wrong in places there):

    3.0B transistors @ TSMC, 40nm
    2 x 16-way FMA SM, IEEE754-2008, 16 SMs
    Each SM has four SFUs
    384-bit GDDR5
    ~650/1700/4200MHz (base/hot/mem)
    8 pixels clock address and filter per SM
    48 ROPs, 8Z/C clock
    64KiB L1/smem per SM (48/16 or 16/48 per clock config, not user programmable as far as I know, at least not yet)
    Unified 768 KiB L2 (not partitioned now, so a write to L2 from any SM is visible to all others immediately)
    Unified memory space (hardware TLB, 1TiB address, 40-bit if my brain's working)

    Each SM dual-issues per clock on two half warps, for two clocks. Instructions can be mixed, so FP+INT, or FP+FP, or SFU+FP, etc. If DP instructions are running, nothing else runs. Although I don't think that's quite right, need to run some CUDA on a chip to test.

    1.5K threads per SM in flight (1K in GT200), 32K FP32 registers per SM (up from 16K in GT200).

    DP is half rate as mentioned, and it's a FMA too. All memories the chip talks to, from registers up, are ECC protected (potentially, nobody ships ECC GDDR5, and I think the chip will address 'PC' DDR3 for that in the end). Not sure what scheme or penalty.

    New generation of PTX, CUDA 3.0. C++ in CUDA because of the unified address space.

    Some new predication support, although it's really not clear how the hardware makes it happen. Seems you can predicate any instruction.

    New atomic performance. Seems like it'll coalesce atomic ops in a warp and won't hit DRAM if the update fails, instead using L2 (GT200 replayed the transaction at DRAM hundreds of clocks later). The whitepaper explanation is wrong.

    Seems RF per SM has enough ports (256) and support from the operand fetch hardware to sustain full FMA rate across the chip.

    It can run multiple CUDA kernels now at the same time. Limit is 16 per chip (one per SM), but I think that'll be capped at 8.

    I think the tesselator is a software pipe with very little hardware support, too.

    Anyway, that's from memory, more later when I'm free.

    [​IMG]

    If you want more, dkanter's ready with his (and it's excellent) here.
     
    #1 Rys, Sep 30, 2009
    Last edited by a moderator: Oct 1, 2009
  2. Rys

    Rys Tiled
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    3,893
    Location:
    Beyond3D HQ
    Can we keep this thread free of crap, please (other than my initial post!). Post links to the webcast so people can keep up, and other Fermi pieces from around the web when they pop up. Old thread got a bit silly at times, less of that if poss.
     
  3. homerdog

    homerdog donator of the year
    Legend Veteran Subscriber

    Joined:
    Jul 25, 2008
    Messages:
    5,559
    Location:
    still camping with a mauler
  4. McElvis

    Regular

    Joined:
    Apr 15, 2002
    Messages:
    269
    Location:
    London
    Some info from that anandtech article:

    "The price is a valid concern. Fermi is a 40nm GPU just like RV870 but it has a 40% higher transistor count. Both are built at TSMC, so you can expect that Fermi will cost NVIDIA more to make than ATI's Radeon HD 5870.

    Then timing is just as valid, because while Fermi currently exists on paper, it's not a product yet. Fermi is late. Clock speeds, configurations and price points have yet to be finalized. NVIDIA just recently got working chips back and it's going to be at least two months before I see the first samples. Widespread availability won't be until at least Q1 2010.

    I asked two people at NVIDIA why Fermi is late; NVIDIA's VP of Product Marketing, Ujesh Desai and NVIDIA's VP of GPU Engineering, Jonah Alben. Ujesh responded: because designing GPUs this big is "fucking hard"."
     
  5. Bouncing Zabaglione Bros.

    Legend

    Joined:
    Jun 24, 2003
    Messages:
    6,363
    There's no hardware out there yet, and it's going to be expensive and damn difficult to make. I hope Nvidia don't have to compromise the design and give us something less than all these slides promise.

    I don't want to raise the spectre of NV30, but the last time we saw this kind of forward publicity from Nvidia was when they were under pressure to produce an extremely ambitious design on a process that wasn't ready for such a complex design, that was running late, and something had to be put out as a spoiler against ATI's recently launched and very successful R300. We got a lot of promises that didn't translate into the finished product

    One thing that will work both for and against GF100 is that they seem not to be focussing on the gaming side of things, but are sidestepping into the GPGPU realm. Obviously AMD and Intel may not follow them there, as they have CPUs to sell, but for Nvidia, it might make sense to make this new chip something other than a CPU or a GPU and effectively carve out a new market for themselves.

    The only problem will be if gamers no longer see this as a gaming product, and don't go for it. I'm not sure OEMs will want it at what's got to be a higher price than competing products, when it seems to be aimed at the GPGPU segment rather than gaming or general purpose use.

    It seems to be an amazing product if it lives up to the hype, but in the same way a Bugatti Veyron is an amazing thing - but it's not one I am likely to buy except for it's gaming/video applications. Where's all the gaming stuff or is Nvidia moving away from that market?
     
  6. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
  7. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,941
    So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center? Especially the very center. Scheduling?
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,251
    Location:
    Chania
    Rys,

    16 pixels/clock address and setup/SM? Are you sure 256 TMUs aren't way too much overkill for that kind of bandwidth?

    Also when you state 8Z/8C samples /clock for the ROPs, I assume it's either/or as in today's GPUs?
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    So, the addition of ECC to the GDDR interface would definitely reflect on the chip's perimeter occupancy -- 64+8 bits per channel, for grand total of 432-bit data bus!?
    And looks like there will be third revision of NVIO companion ASIC for the thing. :D
     
  10. Groo The Wanderer

    Regular

    Joined:
    Jan 23, 2007
    Messages:
    334
    @1.5GHz/6GHz, but that may only be the current ones.

    Target of 750, I doubt they will be able to do it. Then again, Dear Leader might be flogging the troops until morale improves, and is gunning for higher, but that will likely mean only more delays. See G200 - the worlds first .933TF GPU for more on this.

    2:1 ratio, the targets are 1.5TF SP, 768GF DP, but again with the caveat of clocks willing. I have reason to believe they won't be unless you are in the press.

    Gosh, really? Who would have guessed?
    http://www.theinquirer.net/inquirer/news/1137331/a-look-nvidia-gt300-architecture
    Almost like I knew what I was talking about all those months ago. Who would have thought. :)

    -Charlie
     
  11. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    5,726
    4 humongous piles of L2.
     
  12. Arun

    Arun Unknown.
    Moderator Veteran

    Joined:
    Aug 28, 2002
    Messages:
    4,971
    Location:
    UK
    Nice copy-pasting. I mean by NVIDIA's synthesis team, not by you, of course.

    Overall, I quite like the SM design - I was expecting the dual-MADD layout for a number of reasons (hint: GT200 didn't expose the full 1024 threads, so I knew it was going to jump to 1536, which meant 6 virtual RF read ports), although I'm surprised they've gone for dual-warp instead of dual-instruction; pleasantly surprised, mind you. I'm not pleasantly surprised by the fact 99% of your execution hardware is taking a nap when doing, say, basic integer operations which are quite important to me. Oh well - you can't please everybody! Even in terms of MUL/ADDs for graphics programs though, it seems rather inefficient.

    The SMs, TMUs, and MC-linked blocks are all easy to notice on the die shot. In the bottom left of the central block lie all the 'unique' stuff, conveniently quite near to the PCI Express analogue. What I find interesting, however, is that the MC-linked block is so huge. Seems like a lot of formerly "central" functionality was moved to the MC-linked blocks; I wonder if that includes input assembly and all of its little friends later in the pipeline! (also I really should go on IRC sometime!)
     
  13. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
  14. jaredpace

    Newcomer

    Joined:
    Sep 28, 2009
    Messages:
    157
    [​IMG]

    fellix beat me to it :)
     
  15. liolio

    liolio French frog
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,567
    Location:
    Bx, France
    It will huge, more than 40% of a HD 58xx. Just by the look it doesn't look as tigh as ATI design.
     
  16. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Oh great, there's a die shot of this newfangled chip but AMD has yet to supply a die shot of Evergreen. Gaaah! :evil:

    The presentation on NVIDIA's site reminds me of Intel's Nehalem and QPI presentations.

    http://www.overclock.net/7295727-post8.html

    Tesla AIB unveiled, don't know if it's functional or just a mockup to boost confidence though..
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,661
    Location:
    London
    Anyone seen any TMUs? :razz:

    Jawed
     
  18. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    Some fun at the perimeter:

    [​IMG]
     
  19. Bob

    Bob
    Regular Subscriber

    Joined:
    Apr 22, 2004
    Messages:
    424
    How so?
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...