NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,866
    Location:
    Well within 3d
    Full MIMD would be larger instruction caches and multiplying the resources at the front end. In terms of issue ports, a physical SIMD of width 16 pushed to 16 MIMD units would require 16x the decoders, issue ports, and scheduling.
    It's not necessarily 16x the hardware because these are potentially simpler than the more complex SIMD unit.
    Regardless, Fermi is already plenty big.

    The primary argument for DWF was that Nvidia's scheduling and register hardware was already oddly complex for what it was doing, and DWF was an incremental increase that could yield throughput decently close to what MIMD could offer for the workloads targeted.

    This came up in the old G300 speculation thread. It's quite a trip down memory lane to go back there.
    A lot of what Fermi turned out to be reflected a lot of the grumblings at the time, and the apparent die size reflected some of the fears.
     
  2. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Even from a purely software perspective though the proverbial "crack in the armor" comes with shared memory and barrier synchronization, which given the limitations of how they can be used in the current programming models gives a greatly restricted model of "threads" compared to the usual definition. Even a simple producer/consumer model can't be easily modeled directly since all "threads" in a group are forced to converge at all barriers. So while they can predicate and do other SIMD-like things to appear to execute different code, they cannot go off on arbitrary control flow graphs - at least not with the ability to ever share data inside that control flow (which sort of limits the utility not to mention the abstraction...). Perhaps this limitation will be lifted with Fermi though - we'll see.
     
  3. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Even in regular CPU models, you end up with some of these problems. Look at the Java Memory Model spec for example. There, you flush around synchronization primitives, or you go with optimistic concurrency and validate/retry (with the extreme being software transactional memory, just added to C#)
     
  4. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Right, but the existence of both compiler and processor level memory barriers is not what's interesting here, as you need them in almost every iterative language that supports multi-threading. The problem is that if you restrict where these things can exist with respect to control flow, and further limit the ability of these barriers to - say - a global scope (or at least global with respect to shared memory) then you remove a huge amount of expressiveness that even the Java model still has. You also completely remove the illusion that your "threads" are executing "independently" since they can effectively only make useful progress when running in conceptual lock-step (with predication).

    This is standard for SIMD, but non-standard for the use of the term "threads", hence the question of terminology choice. Now I will give NVIDIA et al. props for constantly trying to increase the expressiveness towards the goal of making these things operate as if they truly were independent threads by the traditional definition, but we're still a ways off. Fermi will undoubtedly bring us closer but it remains to be seen by how much.
     
  5. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Can atomics be used to implement barrier synchronization of threads that have arbitrarily diverged? If so, then wouldn't the illusion breaking __sync be an implementation detail that is exposed to allow improved performance in some cases, rather than a reason to claim that NV is misusing the term thread?
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Making sense of threads/vectors

    To me, the most consistent and accurate terminology seems to be that

    warp/wavefronts size := vector lanes

    and

    in CUDA speak, (threads/block)/(warp size) := number of threads

    • G80
      has at most 24 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions
    • GT200
      has at most 32 hw threads per core, (30 cores overall), where each thread executes 32 wide simd instructions
    • Larrabee
      has at most 4 hw threads per core, (?? cores overall), where each thread executes 16 wide simd instructions. LRB also implements multiple sw threads per hw thread for additional latency hiding
    • Cypress
      cypress has at most ?? hw threads per core, (20 cores overall), where each thread executes 64 wide simd instructions. needs lot of ILP in code to reach peak performance.
    • Cell
      and cell has at most 1 hw threads per core, (8 cores overall), where each thread executes 4 wide simd instructions (counting only spe's)
    • Nehalem
      has at most 2 hw threads per core, (4 cores overall), where each thread executes 4 wide simd instructions. needs lot of ILP in code to reach peak performance.
    • Fermi
      has at most 48 hw threads per core, (16 cores overall), where each thread executes 32 wide simd instructions

    ALL chips above allow simd divergence to be handled, but with some performance penalty. Programming models are of course different depending upon vendor.

    Chips traditionally called CPU's expose SIMD ISA to programmer, leaving it them to write SIMD code, (or to autovectorizing compilers).

    Chips traditionally called GPU's expose do not expose SIMD ISA to programmer. They usually write scalar code which is vectorized in hardware.

    Some chips, (like cypress, nehalem etc.) expect needs lot of ILP in code to reach peak performance.

    Vectors vs threads don't have to be mutually exclusive.
    What do the specialists of B3D think? :razz:
     
    #786 rpg.314, Oct 15, 2009
    Last edited by a moderator: Oct 15, 2009
  7. spacemonkey

    Newcomer

    Joined:
    Jul 16, 2008
    Messages:
    163
    Likes Received:
    0
    I got an idea - let's call NVIDIA's threads "thNeads". That should clear up the confusion :mrgreen:
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,416
    Likes Received:
    178
    Location:
    Chania
    Not bad; it's just a tad too hard to pronounce. How about nThreads(tm)? :cool:
     
  9. Karoshi

    Newcomer

    Joined:
    Aug 31, 2005
    Messages:
    181
    Likes Received:
    0
    Location:
    Mars
    cudaThreads, obviously? As in:
    How many cudaThreads does your CPU (cuda processing unit) support?
    Further:
    ROPs->COPs
    TMUs->CMUs
    Cache->cudaches
    MC->CudamemController
    [FONT=Arial, Helvetica, sans-serif]Jen-Hsun Huang->CUDA-Hsun Huang

    [/FONT]
     
  10. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I like this one...
     
  11. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,319
    Likes Received:
    23
    Location:
    msk.ru/spb.ru
    But we'll need to rename most of Cypress to Stream-something then. SBEs, SMUs, SMCs etc.
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    No, that would be wavey-something... :)
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,866
    Location:
    Well within 3d
    There are kernel threads, userspace threads, pthreads, etc.

    How about:

    GPUthreads
    ASICthreads
    Slavespace threads
    Evanescent threads
    Datumthreads
    Pseudothreads
    and so on threads
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    B3D threads... :)
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,120
    Likes Received:
    2,866
    Location:
    Well within 3d
    I think that makes sense to me in as a general rule. I wouldn't see it being a problem as long as the exposed features exist in addition to standard functions and are not required to be used.
     
  16. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,485
    Likes Received:
    396
    Location:
    Varna, Bulgaria
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,418
    Likes Received:
    411
    Location:
    New York
    I love how Charlie slips in these zingers with no proof, source or corroborating evidence. Journalism at its finest :D
     
  18. Rangers

    Legend

    Joined:
    Aug 4, 2006
    Messages:
    12,322
    Likes Received:
    1,120
    But everything else he's right about, including that they EOL'd the GT200 parts, something you steadfastly denied across various forums until you couldn't do it anymore.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Well this year so far, charlie has been mostly accurate :)
     
  20. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,319
    Likes Received:
    23
    Location:
    msk.ru/spb.ru
    Really? That's interesting. And the confirmation came from Charlie too?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...