NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    But that only works if you can predict latencies at compile time. How does that work when you have no idea what data you will be loading (and from where) at runtime?
     
  2. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London
    http://insidehpc.com/2009/10/06/interview-with-hpc-pioneer-justin-rattner/

    GPUs, exascale, and the HPC of tomorrow
    Although Intel stopped producing commercial supercomputers in 1996, it hasn’t stopped shaping the industry. The architectures pioneered by the company dominate today’s Top500 lists, and chips made by Intel were in 79.8% of all systems in the most recent Top500 list (Jun 2009). One of the challengers to the hegemony of Intel’s processors in HPC today is the GP-GPU. Rattner doesn’t see the GPU as a long-term solution, however. “For straight vector graphics GPUs work great, but next generation visual algorithms won’t compute efficiently” on the SIMD architecture of the GPU, he says, drawing a parallel between the era of vector computers and the Touchstone system and today’s processor/GPU debate. As he sees it the GPU restricts the choice of algorithm to one which matches the architecture, not necessarily the one that provides the best solution to the problem at hand. “The goal of our next generation Larrabee is to take a MIMD approach to visual computing,” he says (That’s a Larrabee die pictured at right being held by Intel’s Pat Gelsinger). Part of Intel’s motivation for this decision is that the platform scales from mobile devices all the way up to supercomputers. And they have early performance results that will be presented at an IEEE conference later this year that show that the Larrabee outperforms both the Nehalem and NVIDIA’s GT280 on volumetric rendering problems.
     
    #642 Vincent, Oct 9, 2009
    Last edited by a moderator: Oct 9, 2009
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    At this point I'm modestly interested in how one can say Larrabee is somehow free of SIMD baggage without some intellectual dishonesty.
    Maybe I'm getting hung up on the 512-bit SIMD unit per core.

    With Fermi allowing 16 kernels to operate on-chip at the same time, how non-MIMD would it be compared to Larrabee?
     
  4. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Maybe if you are running your own cluster and this is an HPC scenario. As I said before, this is super-niche. GPGPU scenarios are not going to make NV much money. And even if you allow that this is target, the enterprise computing market (say, big pharma) doesn't make big purchases on cost/performance alone, otherwise no one would have bought Sun E10000 servers. You're thinking like an engineer and not a middle manager.
     
  5. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Making the hardware deal efficiently with lots of memory accesses distributed randomly throughout the program is expensive, clauses don't make it any more expensive (it's just a couple of bits in the end). It's just a hint which helps make life easier on the hardware for when life can be easy on the hardware ...
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Then why is NVidia going down this road? AMD only has to worry about GPGPU performance and development tools if it does indeed become a big market that approaches the size of the GPU market. Unless I'm mistaken, that assumption is a prerequisite to this whole line of discussion.
     
  7. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Why do you need to predict latencies? Clauses are just a way to have a simple scheduling system that increases latency hiding ability compared to fixed round-robin sheduling (like R100-R400, NV20, and probably NV30 & G70). It works like this:
    1. Run as many ALU instructions as you can without more data
    2. Issue loads for as much data as you can without needing more ALU instructions, and wait for the data to arrive. Note that some of this may only be needed later in the program stream.
    3. Repeat
    Batches simply alternate between 1 and 2. Hopefully you have enough batches to make sure that you are maximizing either ALU or fetch throughput. If so, then you're running as fast as possible and hiding all latency.

    Abandoning clauses for a more sophisticated scheduling system (e.g. where you don't wait for loads to complete until you actually need them immediately) will help for some extreme cases, but increasing the register file often does the same thing by allowing more threads, so whether it worth it to do the former, the latter, or neither depends on the workload.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I agree with Mintmaster. Clauses make *a lot* of sense in a data parallel world when you don't need some super clever scheduling mechanism to hide all sorts of latencies and I don't see this drastically changing any time soon.
     
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Yes, I understand the benefits of clauses but this snippet above is what I don't get. Isn't the predictability of loads going to fall dramatically as workloads become less neat and uniform like they are with graphics today? How can the compiler construct load clauses when it doesn't know what the program will need to load at runtime? I'm not talking about static texture lookups in a shader here.

    Also, what would a scheduler look like if it has to support both clauses and instruction level dependency checks?
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I think the point here is that clauses don't hurt. They just help for the workloads where it is statically predictable. They obviously cant help when workloads are irregular.
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Hence my followup question, how do you do both?
     
  12. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    I was assuming that the purpose of Fermi is still on the consumer's desktop, not in the cloud. It doesn't make a whole lot of sense for NVidia to target the cloud with GPGPUs, anymore than it makes sense with LRB. There's a market for that kind of stuff, but it is dwarfed by the consumer market.

    If you add up the largest clusters in the world today: microsoft, google, yahoo, amazon, it might total 3-5 million machines. Everyone else is collectively dwarfed by them. NVidia would have to sell GPUs at $500 per chip to replace their current revenues. At that price, they are uncompetitive with commodity x86 clusters.

    This is an ancillary revenue stream, it cannot be the primary one. A special use case might exist for renderfarms, but look, Pixar rendered Cars on a 3,000 CPU farm IIRC. If NVidia's proposition is "buy Fermi and you only need 300 machines instead of 3,000", that would constitute only a $3million revenue if GPUs were $10k per piece.

    I would roughly equate the size of the 'pure GPGPU' (no graphics) market on par with "workstation" segment, where people in bioinformatics, finance, et al might like to run some kernels. In the special effects industry, real-time scene preview would be very valuable.

    No, the real win, if there is one, is to find killer applications for GPGPU on the desktop, ones where AMD is at a disadvantage. Silicon Graphics showed the foley of trying to address the server market only.
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    Why would it? It can simply flag individual loads as clauses.
     
  14. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    Exactly. Seems like the scheduler would just be issuing clauses rather than single instructions and an instruction depending on the results of a load would terminate a clause. I'm not sure how variable latency of loads would be a problem unless ATI is doing round-robin scheduling over clauses. I guess statically scheduling a clause could be a problem if arithmetic instructions have data-dependent latency; e.g. I believe divides on Intel CPUs are faster when divisors are small - is there anything like that in GPU land?
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    I'm drawing parallels to one of my systems where we invoke various processes that manipulate the data based on attributes of each data element. Those processes can be all arithmetic or hit a remote datasource. So you don't know in advance what instructions will be executed or which loads will be needed so how can you define clauses around things you don't know? Clauses are fine for current workloads where nearly everything is static, but you guys seem to be saying you can use clauses for a more unpredictable environment too. I just don't see how that works, since clause demarcation happens in the compiler.
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Okay, so you aren't talking about niche applications then.

    If you're talking about making software that runs on an already installed hardware base, then how does it help NVidia to sacrifice their GPU competitiveness? A software dev will definately want to double their market by making it work and work well on AMD hardware. If these applications become prevalent enough to affect purchasing decisions, then ATI will adjust their hardware when necessary.

    When I was talking about a product, I wasn't talking about replacing exisiting farms from Microsoft, Google, etc. I was talking about a killer HPC product with mass market appeal. Imagine automated driving, for example. I thought this is the type of GPGPU market explosion that you were talking about. I don't really see any common desktop application needing that kind of power, because we don't even have much to use 100 GFlops on top CPUs today, let alone 3 TFlops of OpenCL power.
     
  17. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,610
    Likes Received:
    825
    It's just groups of instructions ... all heavier branching does is make clauses shorter.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    There must be some overhead associated with clause processing such that the cost of, for example, 5 1-instruction clauses is higher than 5 ungrouped instructions.
    If there is such a cost, then there is a threshold where clauses don't make sense.

    I think there is some number of cycles that have to be hidden by other work, but I don't think the number is disclosed.
     
    #658 3dilettante, Oct 9, 2009
    Last edited by a moderator: Oct 9, 2009
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Yep, but demarcation is just the first part of it and that happens in the compiler. What does the scheduler gain by juggling lots of tiny clauses? And on the flip side why can't a fine grained compiler/scheduler replicate claused behavior by simply moving loads to the most convenient place in the instruction stream (which I assume is happening anyway). If I understand Nvidia's architecture correctly multiple concurrent memory fetches can be in progress from the same warp.
     
  20. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    But this assumes that adjusting their hardware preserves all of their current benefits in terms of cost. In other words, you assume that NVidia can't really make their implementation more space efficient, but ATI can add all of the benefits of NVidia's architecture for the class of application we're theorizing, yet not suffer from it.

    HPC for me means scientific applications in computing clusters. I've never heard the term HPC applied to embedded computing. Typically HPC clusters work on problems like n-body simulation, computational fluid dynamics, weather simulation, molecular biology, etc.

    Sure, if NVidia could get every Toyota to use a Fermi for visual processing, that would be a large contract, but I was assuming that there is still a large market for media applications, be it games, or photo processing. It Nvidia could demonstrate sheer superiority in physics with a killer app, or say, vastly accelerated OpenCL based face recognition for photo libraries, it would be a good use case.

    I mean, if you really want to be pessimistic, there is simply no need, currently, for people on the desktop to have 2 TFlops of computing power, be it Fermi or R8xx. Most of it goes underutilized, no one is truly taking advantage of all that power on the desktop, and frankly, outside of games, no one has demonstrated much of need for that kind of power. On consoles, it might be a different story, but today, Intel, Nvidia, and AMD are building products that consumers don't really need.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...