NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Jawed - So ATI HW support interleaved execution of 8 kernels on a single "core"? Does each core get to run 8 different kernels, or is it 8 across the entire chip? Regarding your comments on the NV scheduler - I don't understand the difference between scoreboarding threads and instructions. Do you mean entire warps are now scoreboarded? I'm having a hard time imagining why they would ever have scoreboarded at the level of a single "thread"; can you provide links to any documentation verifying this?
     
  2. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    AMD L3 isn't exclusive...nor is it fully inclusive. It's "special" :)
     
  3. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I absolutely am thinking about the hardware implications. We do not need 4X the ports. I may need fewer, larger banks with my design, but probably not, depending on what the current design is. 64-pixel batches with FP32 granularity means there are only 256 different locations in each register file channel.
    DOA? Right now there are 64 pixels batches, and a 32-bit float is fetched per cycle in each of 4 register file channels (xyzw). That's 1kB/cycle. After 3 cycles, you get 3 reads per channel, or 12 reads per instruction group.

    I do have to confess one oversight that I wish someone would have pointed out earlier. ATI's GPUs allow up to 128 vec4 registers per thread. With 64 pixel batches, that works out to only two batches per SIMD. Thus my eight active batch system can't work. :oops:

    But come on, do we really need to run 512 float programs without spillover? Aren't NVidia GPUs capable of far less? Still, it probably won't be hard for a scalar design to have two modes of operation, with the first as I described, and the second working on two active batches but requiring non-dependent instruction groups as is currently the case.

    Sorry, I couldn't find it.
     
  4. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Finding a way to prevent your hardware from being commodified is genius. We all know that eventually, they'll be something like DirectPhysics, but open standards don't materialize out of thin air. First, independent actors innovate with proprietary solutions and refine them via market feedback. As competitors enter the space, the fragmentation causes issues for developers, until finally a standard, which is usually synthesized from the existing proprietary solutions, arrives.
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    I guess CUDA is a perfect example of what you've just described.
     
  6. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Just about all "standards" I'm aware of take this route. Those that start out as standards first, without having gone through a proprietary stage usually fail, because they are created on paper by bureaucracy before being tested. TCP/IP is one of the exceptions, although you have to ignore prior research.
     
  7. Panajev2001a

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,187
    Likes Received:
    8
    This is what makes me think OpenCL has a chance :) (given its quite apparent CUDA roots).
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    There is hope in this regard. The thing with Opengl was that it drew a lone way too high from silicon, and as technology progressed, the hardware evolution diverged from the abstraction that ogl was supposed to provide. And the bad guys made sure that ogl will never be able to recover:evil:, witness the 3Dlabs' efforts and the nuking of the object/template based ogl3.

    Ocl, otoh, is following hw evolution. ocl 1.0 is pretty much G80 and cypress is a natural, small extension of that, say ocl 1.1 (it is due out in 2h09/1h10 btw). Rapid evolution along with hw is the best shot ocl has at surviving the deathmatch that is extension-race.

    Yet, I am not discounting the deleterious effects of politics. There is hope, but caution in this regard is necessary.
     
  9. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    I think HTML5 WHATWG vs W3C is informative here. The W3C is like ARB, an industry consortium of many players, many of which don't even implement anything. The 'too many cooks' syndrome often acted to significantly delay specs, add useless extensions, or veto others. Progress slowed to a standstill. A parallel 'industry' HTML5 effort was born, with basically Mozilla, Apple, Google, and Opera, the four actors with horses in the race (MS recently joined). As a result, in just 2 years, browser capability has massively improved. Why? Because each vendor extends HTML5 with new features all the time, and then they quickly harmonize their implementations.

    Extensions are a good thing, they allow quick experimentation and feedback. The breakdown occurs when the players cannot quickly agree on harmonization. Realistically, NVidia, Intel, AMD, Apple, and perhaps Microsoft, should handle this, keeping the group small, and dedicated to people who can execute quick implementation changes. Maybe, I could see Sony/IBM in the mix, but the group should be kept small. Extensions to OpenCL shouldn't be feared, as long as a short list of players (who are relevant to the market) can sync up extensions quickly.

    They're bad when no one can agree, and the platform fragments.
     
  10. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    So you're saying a somehow artificially increased demand for graphics horsepower, be it physx, be it eyefinity, be it 3D-stereo, fragments the market even more by doing exactly what?

    I agree, all that stuff doesn't help alleviate the basic problem, but it helps both vendors to find some justification for increased graphics performance - which is basically what their business is all about.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I'm mostly interested in the vector portion of each core, to make a broad comparison of implementation cost.

    I'm basing the terminology on Intel's, used in describing Larrabee, the only sane terminology out there.

    A core does instruction decoding, a thread has a program counter and each thread consists of strands that populate SIMD-ALU lanes with work and which take multiple issue cycles to run basic instructions (e.g. single-precision ADD is 1 issue cycle on Larrabee, 2 on GF100 and 4 on R800).

    A kernel is an entire set of instructions that is independent. Though there's a terminological problem relating to scope of execution here as Larrabee seemingly supports 128 kernels, as 32 cores x 4 kernels per core, while NVidia supports 16 kernels at 1 per core, while ATI probably supports 8 kernels with any subset of those 8 (1 to 8) per core.

    There's also a doubt in my mind over whether ATI truly supports multiple compute kernels per core. R600 supports 8 render states per GPU, where each render state can have a distinct VS, GS, PS etc. It'd be logical that upto 8 compute kernels can be scheduled on a core in R800, but there's only a very vague statement of multiple kernel support so far.


    1. Instruction issue in ATI is VLIW (which is also a variable length VLIW). The SIMD is 16 lanes wide. For some reason I forgot to mention that detail. Though it turns out all 3 architectures are 16-wide, so that in itself isn't a point of distinction.
    2. Calls are clearly described in the ISA document - they're static and support recursion through the Sequencer's stack.
    3. The 128KB cache per controller is outside of the cores and since it isn't read/write doesn't really have a meaningful functionality from the point of view of the core.
    4. I'm not interested in "wavefront", it's just a stupid name for thread.
    5. Shared memory is smaller than in GF100 (upto 48KB per 32 threads/1024 strands). There are coding implications in the way LDS operates in comparison with shared memory in NVidia, but I didn't want to go into those.
    1. The SIMD is 16 wide and there are 16 strands per thread, so one issue cycle per instruction.
    2. Lines can only be locked in L2, so L1 isn't truly shared memory. GF100 is providing locking in L1 (though granularity of locking is not very exciting).
    3. This is similar to how shared memory works.
    4. Rather meaningless statement. In terms of the vector unit, they are the same in all meaningful senses. Intel merely has only 4. In order to improve latency hiding, programmers are forced to use fibres (a purely software construct, so arbitrary in number) for sharing a thread's execution allocation.
    I could also have more explicitly compared the various forms of gather and scatter.

    Intel has two gather paths (texture units and direct, though the texturing path effectively overloads the cache system/ring-bus - unclear if the TUs are useful for non texturing fetches) while it appears that the gather path in ATI and NVidia is common through a core-dedicated texture unit (i.e. fetches without filtering). All rates are 16 scalars (32-bit) per clock - though we're waiting to see the LSU clock speed in GF100.

    Scatter appears to be 16 per clock in both Intel and NVidia, while in ATI it appears to be 64 across the entire GPU (i.e. only 2 cores can scatter at any time at the rate of 32 scalars per clock per core). ATI caching appears to be only cursory here (only at the MCs for coalescing - though there's a question mark over how global atomics are implmented and the read-back of global read/write resources in general), whereas Intel and NVidia have dedicated caching per core. Clearly there's not enough off-die bandwidth for all Intel and NVidia cores to scatter into memory simultaneously.

    I could have included constant and instruction caches in my comparison, but I decided they're probably too small and aren't key comparison points when comparing the implementation cost.

    Jawed
     
  13. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,806
    Likes Received:
    473
    L1 on NVIDIA is presumably banked (because it's also used as shared memory) so on random scatters which hit cache it would have about 3 times higher throughput (I think, my probability theory is rusty).

    PS. using Johnny come lately's definition over the established ones (AMD/NVIDIA agree on what threads are) makes little sense to me and will generally just cause confusions. Also I don't think Intel's chosen definitions make a lot of semantic sense to begin with.
     
    #733 MfA, Oct 12, 2009
    Last edited by a moderator: Oct 12, 2009
  14. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Do 512 cuda cores and 1600 stream processors make 'semantic sense'?
     
  15. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,806
    Likes Received:
    473
    It's merely curious ... as opposed to counter-intuitive. A fiber made of strands ... really?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    From a software standpoint, my impression of things makes me think that Larrabee appears to be 128 cores, Nvidia 512, and AMD 320.
    The GPUs through their dedicated scheduling hardware allow for divergent behavior for work units or pixels that as far as the software is concerned is an individual core.
    Larrabee's emulation of similar behavior through strands and fibers is software-driven so it is software-visible.

    At a hardware level, I'd say it's 32, 16, and 20 cores.

    The trade-offs look to be interesting.

    In terms of operand bandwidth, AMD has the 3840 32-bit operands loaded from registers per clock, at peak.
    This comes from over 5 MiB of register file.
    The LDS adds an extra aggregate 640 KiB.

    Larrabee per clock has the capacity for 1536 such operands per clock. It clocks much higher, and it would take around 2 GHz to roughly equal Cypress.
    The register and L1 resources are small compared to Cypress, but they are backed up by the massive L2.
    The aggregate total of read/write programmable storage for both is something like Cypress having something over half the storage of Larrabee, while also being something over half the die size.
    At least as far as these two designs go, the differences are in the arrangment rather than the proportion of SRAM to logic.

    Cypress has a huge amount of register file, which offers high bandwidths within the registers, but each SIMD winds up sucking data through a straw when it comes to any transfers beyond that, and writeback in general has a much longer path to take than Larrabee's R/W cache.
    Larrabee has lower bandwidth between (edit: within) the L1/reg file area, but has a much more balanced L2 to compute ratio.

    I'm still poring over what Fermi has to offer. I am not sure what the aggregate amount of register file is for the entire chip, is it 128K per core or in total?

    Given the size of the chip, this design seems to lean more heavily on the logic side.
    The operand bandwidth per clock is what Larrabee would get.
    The L1/shared mem bandwidth with 16 L/S units is in aggregate half that of Larrabee because there are only 16 cores.

    Per core:
    The Fermi arrangment may be the most flexible of the three, though this depends on these L/S units being able to perform independent access (before taking bank conflicts into consideration) for 64 bytes (edit: in total from) 16 different accesses.
    Larrabee works on the granularity of cache lines, with 64 bytes per access.
    Cypress can get 4 point-sampled values, though it seems like 64 bytes is not to be expected without fetch4, which may lie between Larrabee and Fermi in flexibility.

    The global path to memory for Fermi is a little vague.
    Larrabee has the 1024 bit ring-bus of unverified implementation.
    Cypress has a 1024 bit crossbar for the texture path. The b3d article seems to hint with the 32-byte cache line sizes that the best case bandwidth is a cache line being accessed from each L2 quadrant.
    Writes go their own way, apparently.

    Without knowing more about how Larrabee implements its bus, I'm not certain about the math, but it looks like Larrabee has much higher and more flexible internal bandwidth than Cypress, given the higher clocks and generalized traffic.
    The ring bus might inject its own quirks, however.
     
    #736 3dilettante, Oct 12, 2009
    Last edited by a moderator: Oct 12, 2009
  17. flynn

    Regular

    Joined:
    Jan 8, 2009
    Messages:
    400
    Likes Received:
    0
    Nothing. Why would they be interested in promoting an open standard when they have their own proprietary and incompatible version (DirectCompute) they'd rather you used?
     
  18. Groo The Wanderer

    Regular

    Joined:
    Jan 23, 2007
    Messages:
    334
    Likes Received:
    2
    Increasing demand id good if done in a standards compliant way. Doing in a way that can possibly only serve a small fraction of the market means a dev must decide to do the extra work to code for a minority. It may increase demand for GPU power, but only if people code for it.

    Proprietary APIs like this make developers say "We will gladly do it if you gladly hand us bags of cash". At least that is what I have seen. :)

    -Charlie
     
  19. nutball

    Veteran Subscriber

    Joined:
    Jan 10, 2003
    Messages:
    2,152
    Likes Received:
    482
    Location:
    en.gb.uk
    Half of whom (at the brand-name level at least) shouldn't really be involved in the development of any API related to high-performance computing.
     
  20. Rys

    Rys PowerVR
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,156
    Likes Received:
    1,433
    Location:
    Beyond3D HQ
    "high-performance computing" covers tens of billions of dollars of revenue.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...