NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm still not sure if you really understand it. The compiler is not telling the GPU what to load, it's telling it when to load.

    For example, it knows that it can't load values needed in ALU instructions 17 and 28 until ALU instructions 15 and 16 determine the addresses for those loads. Hence there is a tex clause consisting of two loads before ALU instruction 17. After instruction 16, the batch is put aside until the two loads arrive, and then it continues. Basically the compiler makes a big dependency graph and groups together loads when it can. The average group size effectively multiplies latency hiding.

    More complex hardware would issue loads at instruction 15 and 16, put it aside until one of them to get back, then have the option of either waiting until the next load came back or executing up to instruction 27 and then waiting. This flexibility is important if you don't have enough threads to saturate either the ALU or TEX throughput with the previous method, but if you do have enough threads then it's overkill.

    I made a little GPU simulation program for Jawed about a year ago to illustrate how all this affects latency hiding and thus efficiency for any given program, but it was based on the simple scheduler. Maybe I'll add a more complicated scheduler to see what kind of difference it makes.
     
  2. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Since I come from image processing background, I can think of a lot of (not yet) common desktop applications which can use a lot of computation power. For example, good image denoise is one of them. Right now, a nice profile based image denoise algorithm takes at least a few seconds on the fastest CPU with all cores utilized. When you put it into video, it takes even more time. Another application is real-time face tracking and face recognition. Fast image search is also greatly needed by many but currently it's very hard to do any meaningful classification on personal pictures. Or even some more crazy applications, such as real-time eye tracking for head position related 3D rendering. Or gesture base user interface. Currently we don't have these applications because current computers are just not fast enough.
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Maybe you misunderstood what I was saying. The more complicated method that I mentioned is indeed what NVidia does, AFAIK. They can replicate claused behaviour, but why would they? Their hardware is capable of fancier scheduling, which is better in low thread situations.

    Tiny clauses are not good. That's why they exist, as they can only help. TEX-5xALU-TEX-4xALU is worse than 2xTEX-9xALU.
     
  4. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    If GPGPU starts to move towards more OO-style programming, clauses may get smaller, as OO programs tend to feature more indirection (polymorphic dispatch), and more "heap" style (external loads) vs "stack" style accesses. I've always thought that the idea language for GPUs is pure-functional, not OO, since in functional programming, mutating external state is the exception rather than the rule, and lack of aliasing and mutation permits equational reasoning in the compiler which can convert many record updates into local allocations.
     
  5. darkblu

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,642
    Likes Received:
    22
    taking on from your last sentence, i don't think you can say that. AMD and NV (intel are yet to) have been catering to a market that was/is quite viable, but it has reached its saturation point. in this regard, nvidia are acting quite logically - they realize there's only this much room in the gaming enthusiast market. GPGPU computing is not only a fashionable development, GPU chip/IP companies actually depend on its wide adoption for their long-term survival. to return to your statement: consumers do not need GPGPU, but they want the benefits of the technology, just like they did not need motion-controlled gaming, but discovered they enjoyed it once a consumer product offered that.
     
  6. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    I wonder what the scheduling rate is on ATI HW: clauses would seem to allow ATI significantly more time (vs. NVidia) to look over a set of wavefronts and determine which one should be scheduled next. That, in combination with no instruction level dependency checking, should translate to a smaller, more power efficient scheduler on the ATI side.
     
  7. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    My point is more subtle than that. If you shipped consoles with TFlops of power, like a XBox720 or PS4, developers would find ways to utilize the power. The PC desktop market is so fragmented, that games end up using HW very inefficiently by comparison. You could ship PCs with 2TFlops GPU and 8-16 CPU cores, but no game is going to be designed with content that targets this. Scaling resolution up is a poor substitute for utilizing this power.
     
  8. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I think that's one of the factors why Nvidias pushing 3D-Vision and Physx as hard as they do.
     
  9. babcat

    Regular

    Joined:
    Sep 24, 2006
    Messages:
    656
    Likes Received:
    45
    Have we learned anything more about Fermi in regards to graphics?
     
  10. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    You mean like a Vantage Extreme-Score of 10.000 at mediocre clock speeds? No.
     
  11. babcat

    Regular

    Joined:
    Sep 24, 2006
    Messages:
    656
    Likes Received:
    45
    I mean like any hardware improvements that are specifically meant to enhance graphics. In one of the tech papers on Nvidia's site it states such improvements have been made.
     
  12. LunchBox

    Regular

    Joined:
    Mar 13, 2002
    Messages:
    901
    Likes Received:
    8
    Location:
    California
    my humble 2 cents.

    Well from what I could gather from this thread, This is most likely a given but it looks like NVidia made Fermi with exploiting the supercomputing business in mind. Since they are so hell bent in putting double-precision capabilities in Fermi. Since that market has a very nice price premium, from what I've seen. Comparatively, Their solution is cheaper than AMD or Intel's. And from the reports/rumors of their ability of disabling parts of their chip points to their gaming line. I don't know. It seems to makes sense to me that way.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    What if it's GT214? Maybe the shenanigans with the varying GDDR5 scalings (seen when comparing "GT215" and GF100) is part of the reason for the failure of GT214?

    Jawed
     
  14. KonKort

    Newcomer

    Joined:
    Dec 29, 2008
    Messages:
    89
    Likes Received:
    0
    Location:
    Germany, Ennepetal
    No, I write MIMD-similar units and you will understand it, when Nvidia will launch the card.

    I reported it on May 15th, the tape out was of course before (in which timeframe exactly, I do not know).

    I am confused in these days. Could it be that 2.4B transistors and 512 Bit was planed for the dektop card and the reported 3.0B transistors and 384 Bit are based on the tesla card with it's many double precision units?
    2547 Gigaflops are wrong. I speculate that the next generation chip will furthermore have MADD and MUL per core.
     
  15. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    With AMD not playing that game they can't afford to cripple double precision for consumer cards, at least not performance wise ... they could disable exceptions or something I guess.
     
  16. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    There's nothing "MIMD-similar" about Fermi at all. You're a willing part of the misinformation and speculation, and some of your sources about hardware are clearly wrong or playing you. Double check your facts (you know, the basic journalism bit of your endeavour) before publishing.
     
  17. CouldntResist

    Regular

    Joined:
    Aug 16, 2004
    Messages:
    264
    Likes Received:
    7
    There has been done tons of research about how either of these can be optimized out. The only price is to give up the ability of loading classes at runtime (as in JVM/.NET/Ruby/Python etc).
    I have no intention of acting as a programming-paradigm-definition-Nazi, but the 3 mentioned language traits: purity, functional and OO can be (and are!) mixed in any combination. Chosing OO paradigm doesn't bind you to side-effects, just like chosing functional paradigm doesn't bind you to Hindley-Milner type system (despite the fact that in most existing cases the opposite is true).
     
  18. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Yet the only thing about which Nvidia has been unusually open is the computing architecture. Read the David Kanter's article. There's absolutely nothing MIMD about Fermi.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I don't think so. If you can run 16 kernels in parallel on fermi, then how is not MIMD like? Perhaps it would be more appropriate to say that it is a transition from SPMD to MPMD.
     
  20. flynn

    Regular

    Joined:
    Jan 8, 2009
    Messages:
    400
    Likes Received:
    0
    You mean cheaper than using AMD/Intel CPUs, right?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...