NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    :lol:

    Damn, must make a plea to the planners and engineers to put a "cookie monster" in our ASIC's! :D
     
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I've asked already Rys where the 16 pixels/clock address & setup per SM come from but I'm still waiting for his answer. I don't think the 16 load/store units have anything to do with it.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    http://www.nvidia.com/object/pr_oakridge_093009.html

    Groovy.

    Jawed
     
  4. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    No, no.. the right words are: give me a wallpaper sized RV870 die shot, now!!!1!1one :twisted:

    On topic: Fermi board snapped ...sort of.
     
    #24 fellix, Sep 30, 2009
    Last edited by a moderator: Sep 30, 2009
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The design put forth really hammers some low-hanging fruit that earlier Nvidia GPUs (and others) lacked.
    The multiple kernels, the closed/semi-closed write/read loop, 1/2 DP throughput.

    The other stuff is downright crazy to see: indirection, exceptions, IEEE compliance, ECC, simplified addressing.

    The mapping of separate memory spaces to lie within the global address space is an elegant way to have the benefits of hardware peculiarity in a more specialized instance without having it impinge the general computation case.

    I had sort of thought of a design using special page table bits that would allow hardware to route to special on-chip storage if enabled, and easily forgettable if not.
    This isn't quite the same, but the idea of using the target memory location to deliniate special things you want done with it is a rather nice touch.

    The size of the chip shows the price of generality, though. FLOP density is not likely to be anywhere near Cypress, and I'd be curious to know if Larrabee's final clocks will mean even the x86 will have an advantage.

    I don't know how it will fare in gaming, or how many other problems there may be, but I have to give Nvidia credit: this design took balls.

    As far as DP is concerned, the quality of this implementation is enough to make Cypress appear as useful as its botanical namesake in HPC.

    Physical and economic realities that may intrude on this (it doesn't exist on a store shelf), but as a topic of discussion, I find this architecture much more interesting to discuss.
    The posited tool sets and initiatives are such that this is the first time I've ever thought a GPU designer took serious computation seriously.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    You mean, cos coherency was too damn difficult :lol: We have to wait for Einstein for full coherency?

    Jawed
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I'm curious about the bandwidth available for transfers through the L2.
    Larrabee has the ring bus, while Cypress has that read-only crossbar.

    The neat part is that it is fully possible to write code that can write to non-coherent space.
    No cache hints, no separate load instructions, just the allocation for shared memory.

    It can be done either way. It seems to be best of both worlds (though coherence is much more relaxed than Larrabee, though stricter than Cypress' nothing).
    I wonder what kind of overheads are involved, and what protocol is used.
     
  9. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Oh, I obviously meant 'execution' as in 'execution units'; the scheduling hardware is still very much there and busy. If you do need a large amount of both FP (especially @DP) and cheap INT stuff, then the 'total' overhead is much larger than it 'needs' to be but not too awful (I'll admit to not fully know what the branching hardware can do on its own though, if much of anything). This is not an usual case, although as I said this is still (less) relevant to the many cases where you've got more MUL/ADDs than MADDs.

    The decision of supporting cheaper operations faster must obviously be based on the cost of higher instruction issue and, critically, virtual RF ports. Given many of the key architectural details I hadn't been expecting (exceptions, SP denorms, and the list goes on and on), it's very clear even to me that this approach makes good sense (especially given the usual workloads). In other architectures (one example idea: 3-way VLIW that shares 6 RF ports), making cheaper operations faster would obviously requires less extra overhead and would make more sense. You just can't have your cake and eat it too.
     
  10. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    You need to give the Nvidia engineers a little more credit :)
     
  11. compres

    Regular

    Joined:
    Jun 16, 2003
    Messages:
    553
    Likes Received:
    3
    Location:
    Germany
    So it will be something like 10 times the jaguar or roadrunner I guess.

    How many GPUs roughly would that be? 10 PFLOPS maybe around 5000 nodes?

    Edit: I was discussing this a couple of months before, because the NEC supercomputers seemed low in flops when compared to GPUs, and the issue of available BW between nodes (inthe GPU case, whereas the NEC has plenty) was big for some applications. I hope they solve this problem in some interesting way with this new super, unless they will specialize on very parallel workloads only.
     
  12. Fusion

    Newcomer

    Joined:
    Mar 28, 2009
    Messages:
    29
    Likes Received:
    0
  13. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Are the transistors really separated like that though? I would have thought that the execution units weren't quite as discrete as shown in the block diagram. int math and mantissa fp math share a lot in common, no?
     
  14. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Wait, are you implying there's something I'm missing about the architecture? :) I assume you can't say, but if not and you're just saying I should be more enthusiastic, then don't get me wrong! This is a very very impressive solution for HPC, and from that point of view it's also a very exciting architecture with lots of nice things. The dual-scheduler approach isn't what I was expecting but it's definitely elegant. All this doesn't mean it's the best architecture for all possible purposes (nothing could ever be) and I was just pointing out one potential case where its weaknesses might be especially pronounced *if* I understood the architecture correctly. Here's hoping I didn't...

    dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)
     
  15. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    L2 is coherent with itself, there is a single L2 for each memory bus ... there can only ever be one copy.

    That's not a cache coherency scheme, that is simply caching.
     
  16. R300King!

    Newcomer

    Joined:
    Aug 4, 2002
    Messages:
    231
    Likes Received:
    5
    Looks like a ring bus!
    *runs* :D
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    In graphics I expect the 32-bit integer units, with all those bit manipulation capabilities, will be doing texturing while the floating point units are doing shader arithmetic - unless of course you have some integer shader math to do, in which case that'll get its turn.

    Jawed
     
  18. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    I somehow have to believe that, if the architecture under very special circumstances might be able to perform simultaneous integer and floating point operations, marketing would find a way to say "3 trillion ops" rather than "1.5 trillion flops". Not that they've [missing] ever done [mul] something like that before....

    Just saying :)

    Is that really 256 TUs? 16 bilerps AND 16 fetches per? That seems somewhat insane.

    -Dave
     
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Hmm since the units are now FMA, there wouldn't be a missing mul or underutilized mul anymore. Right?
     
  20. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
    Does D3D11 compliance require hardware support, sorry for the dumb question Rys. :razz:
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...