Compare & Contrast Architectures: Nvidia Fermi and AMD GCN

Discussion in 'Architecture and Products' started by Acert93, Dec 24, 2011.

  1. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    I have read the Fermi and GCN architecture articles but I am beaconing to my B3D friends to offer a high level comparison and contrast between the GCN and Fermi architectures (not so much implementations). Namely a look at the various elements and how they are coordinated, how they differ, and the relative strengths/weaknesses of various choices.

    I know this is a very broad request but I am curious about why AMD/Nvidia have made different choices, where Kepler may go in a direction more like AMD and/or cases where NV is going a different direction. It seems they have some fundamental differences in scheduling, APUs, rasterizes, etc. Likewise getting into some of the finer details of things like why AMD has stuck to 2 verts a clock whereas NV has moved to 8, and how architectural decisions play into this and so forth.

    What I am not looking for others to compare and contrast are physical implementations or random guesses/conjectures/fan driven noise. If I wanted that I would be offering my own comparisons and contrasts but I know I cannot offer anything other than the superficial comparisons.

    Part of this thread was prompted by discussions about scheduling and how NV and AMD are going about this differently, obvious things like the scalar and SIMD APU designs, and the other part was Rys comment (not sure if he was serious or not) that:

    Ok, for arguments sake let's say that is an accurate comment, why (beyond being new, DX11.1, etc) is GCN the best desktop graphics architecture? In what ways has GCN gone ahead of Fermi, where is Fermi's design still better, and in what ways is the direction Fermi going a better/worse route?
     
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    What still baffles me is that a GTX580 performs incredibly poorly in the targeted synthetic benchmarks on ixbt.com and in pure specs (texture units, bandwidth, Gflops etc), yet still performs acceptably in the gaming benchmarks.

    There is obviously something that makes one architecture more overall efficient than the other, but it looks that this factor is much larger that one would intuitively thinks. It's almost as if you could multiply the pure specs of a GF110 by a certain value higher than 1 (say, 1.4 or so) before you can start to compare the numbers with an AMD chip.

    It'd be incredibly interesting to understand what makes a particular architecture more efficient than the other, but I don't know if it's at all possible. It may be quite literally hundreds of different things, FIFO sizings etc, that each make a difference of 0.1% but accumulate into something really meaningful.

    (Please, don't start throwing perf/mm2 and perf/W numbers at me, that's not what I'm talking about.)
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    GCN is decidedly better IMO too, barring a few rough edges. Comparing it to Fermi is a bad idea since they are not contemporaries.
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    It's curious that there's so much talk about nVidia following in AMD's footsteps when it's GCN that has actually taken a huge step toward the Fermi way of doing things. Dropping VLIW is a pretty big deal and there's the introduction of proper caching, ECC etc. The massive improvements in some compute workloads totally justify the move.

    I honestly don't see much else that's different now that wasn't different last generation too. AMD still handles geometry in dedicated units outside of the main shader core and instruction scheduling relies on simple (but smart) wavefront level scoreboarding. The GDS lives on as well.

    GCN seems to be a very well balanced architecture with no obvious weak points just yet. It has completely shed the VLIW albatross. Fermi's biggest failing was power consumption but it's a pretty well balanced arch as well - more so than Evergreen or NI. If Kepler addresses that particular issue it may be in the running for Rys' "best architecture ever" award :)
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    For graphics it certainly appears that GCN's approach to geometry is more practical and effective. Fermi's distributed geometry processing hasn't really been fruitful.

    On the flip side Fermi does a whole lot more with a whole lot less. Just look at arithmetic, texture and ROP theoreticals. As silent guy said there's a lot of secret sauce at work that isn't apparent in press deck diagrams. There are fewer differences this generation so it should actually get easier to spot the strengths and weaknesses of each arch.

    The potential differences this round could be:

    - Distributed vs centralized geometry processing
    - Scalar unit vs no scalar unit
    - Special functions running on dedicated vs general hardware
    - Batch size of 64 vs 32
    - Number of registers per thread (GCN is now around Fermi levels I think)
    - Handling of atomics
    - ILP (GCN abandons this completely)
    - On-chip bandwidth
    - Memory/ALU co-issue
     
  6. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,431
    Likes Received:
    261
    GTX 580's peak rate is 4 prims/clock, not 8.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,095
    Likes Received:
    2,814
    Location:
    Well within 3d
    On a global level, Fermi has a scheduler that builds kernels and farms work out to each SM, which is probably the closest thing equivalent to a CU.

    The arrangement for Tahiti involves more units, AMD seems to put the command processor, a CS pipe, a work distributor, primitive pipes, and pixel pipes in a box labeled Scalable Graphics Engine.
    That block is probably responsible for maintaining the heavier graphics context, and I think this includes among other things the primitive pipes and ROPs.
    Alongside the graphics engine are two ACE blocks, which skip trying to maintain the API abstraction and just do compute.
    Fermi's global scheduler can theoretically track 16 different kernels, the GCN slides indicate multiple tasks can operate on the front end, but no number is given.

    The front end of Tahiti versus Cayman has at least one simplification, since the formerly global clause scheduling hardware has now been distributed to the CUs.


    A CU versus SM comparison has a number of differences such as whether there is a hot clock and the differing number of SIMD blocks per unit.
    The division of resources for read and write is different. Fermi has a 64 KB pool of memory that is both the L1 and LDS. There is a small texture cache off to the side as well.
    Tahiti has an L1 data/texture cache, and the LDS is off to the side.
    There doesn't seem to be an explicitly separate SFU for Tahiti, but it does have an explicit scalar unit with a shared read-only cache supporting it.

    Fermi has two schedulers per SM, each which does dependence tracking and handles variable cycle operations thanks to the different units it can address.
    Tahiti has four schedulers, but a more uniform operation latency and conservative issue logic means it does not track dependences, just whether the last instruction completed (with a few software-guided runahead cases).
    Fermi's ALUs work with one 128K register file, whileTahiti has split its register file into 4 64K files local to each SIMD.

    The philisophical difference between Fermi and Tahiti probably is stronger than the physical differences. The introduction/exposing of the scalar unit and dispensing with the clause system has meant it has given up on the SIMT pretense. The architecture is explicitely SIMD, whereas prior chips tried to maintain a leaky abstraction of thread=SIMD lane.
    There is a lot less hidden state, though I believe AMD has indicated that there is still some with the texture path.

    Tahiti's write capability brings it closer in line with Fermi. If a cache line is to be kept coherent, it seems Tahiti is more aggressive at writing back values at wavefront boundaries, while Fermi will not force full writeback until a kernel completes.
    Fermi's export functionality seemed to piggyback on the cache hierarchy, with the ROPs using the L2 instead of having their own caches. Its GDS is global memory.
    GCN does not seem to do this. There are export instructions and a GDS. The ROPs are drawn as being distinct from the L2 and not ganged to a memory channel like the cache is. There is an export bus distinct from the L2 datapath.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Apparently GCN was made to play games whose engine works like Serious Sam 3 BFE:

    http://www.pcgameshardware.de/aid,8...xpress-30-und-28-nm/Grafikkarte/Test/?page=13

    Or maybe I should say this game seems to suit AMD cards. Or maybe the game just has the kind of complexity that AMD's drivers can cope with.

    I'm afraid to say these days there seems little point trying to understand why because there's simply not enough raw data to pick apart the interesting parts of architectures.
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,478
    Likes Received:
    383
    Location:
    Varna, Bulgaria
    Well, the game's engine, being DX9 and stuffed with a ton of effects, is probably very heavy on pixel/texture fill-rate and certainly can utilize the generous bandwidth upgrade and raw sampling throughput.
     
  10. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,702
    Likes Received:
    2,430
    ٍSS3 runs better on 6970 than GTX 580 , so it generally runs better on AMD hardware.
     
  11. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    I'm curious how much memory bandwidth plays into this - the HD7970 has 'only' 37.5% more memory bandwidth than the GTX 580 (and considering how GDDR5 works the real usable difference might even be slightly lower than that). It would be very interesting to see how these two GPUs compare with either severely underclocked memory clocks (to test bandwidth efficiency) or with severely underclocked core clocks (to test performance with little bandwidth limitation).
     
  12. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,011
    Likes Received:
    110
    Incidentally, computerbase ran this benchmark on the 7970 with the memory bandwidth of the 6970. http://www.computerbase.de/artikel/...deon-hd-7970/20/#abschnitt_384_bit_in_spielen - judged by these results bandwidth isn't really important in that benchmark, a 10% increase for 50% more memory bandwidth is not much. This is without SSAA though I guess the results could be potentially quite different (or not...) otherwise.
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    In the bring up and optimisation process we ran Tahiti in 8CH configuration as well as 12 to compar how it was dong against Cayman. It was common to so 20%-30% performance improvements for he 12 channel case.
     
  14. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    Very nice - so it does benefit significantly from the higher bandwidth but not so much that it's really bandwidth starved. I assume 12CH vs 8CH is a bigger difference than 2/3rd clock rate like computerbase tested because of how the memory controllers and GDDR5 error correction work. If you look at the article, some games barely scale at all with memory bandwidth (2-3% at most) while many others scale between 20 and 25% so the average for games that do scale at all with it is pretty good.

    So the good news there is that if future drivers improve shader core performance a lot then there'll be enough bandwidth to keep it fed. However there is still one thing I do not understand about the computerbase article: why do games barely scale more with 4xAA/16xAF than without it? There are plenty of TMUs so presumably it's not being too limited by AF filtering performance. Is it being bottlenecked by the 32 ROPs? I don't think it makes sense to that extent but I don't understand what else it could be... Surely your framebuffer compression algorithms aren't so good that MSAA doesn't increase bandwidth at all! :)
     
  15. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,478
    Likes Received:
    383
    Location:
    Varna, Bulgaria
    I think that GCN is definitely limited somehow on a ROP level -- alpha blending rates make it clear that BW is not the problem in this case. Regarding AF, I don't think the count of the texturing units alone is a decisive factor for the performance hit -- raw texturing throughput doesn't diminish the AF sampling latency, so relative performance hit is more or less a constant here. One thing is different here - texture L1 is now a completely reworked by how it operates, from the previous architectures, ever since RV770. On top of that, the access to this L1 R/W cache now is shared by the TMU and the ALUs, while Fermi kept the dedicated texture streaming cache, alongside the new L1, mostly used for register spills.
     
  16. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    My opinion here is that we're walking quite a few miles along the complexity path ignoring the obvious answer to this question quite frequently - hardware is little to nothing without software. NVIDIA gets two rather important things very right: investment in their own SW, which translates into a very solid SW stack across the board, from their graphics/gaming drivers, to the compute side, and investment in devrel, which in this business more often than not means they get to do the work and just ship done code that hooks into some codebase.

    Neither games nor the early compute efforts are pedal-to-the-metal all around hyper optimised efforts on any level (for various reasons that would sit well in a separate discussion), so having a worse driver stack and less optimisation and qualification done on your HW is not something that can easily be undone through sheer throughput on ATI's side. The red herring of ATI VLIW utilisation and whatnot will be gone from future debates, I guess, but the underlying SW weakness will remain for the forseeable future. Directed tests take away almost all of the burden from the software stack, since it gets fed simplistic and often optimised code, so it's far harder for it to flounder, hence why they end up more accurately following hardware differences.
     
  17. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  18. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    I do to a significant extent, especially given what those tests are. I think overestimating the prowess of ATI's drivers for intricate work or the maturity of their CL stack whilst at the same time underestimating the impact of not being the lead development platform is a risky business.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I am curious about how the int32 multiplication is handled. Is it still slower than int24?
     
  20. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,011
    Likes Received:
    110
    Yes, that sounds more like it. The cb numbers looked lower than what I expected on average too (highest was 27% which is more what I expected as average), being only ~20%. Maybe the mix of games (or the settings used) there just isn't very bandwidth sensitive.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...