Compare & Contrast Architectures: Nvidia Fermi and AMD GCN

Acert93

Artist formerly known as Acert93
Legend
I have read the Fermi and GCN architecture articles but I am beaconing to my B3D friends to offer a high level comparison and contrast between the GCN and Fermi architectures (not so much implementations). Namely a look at the various elements and how they are coordinated, how they differ, and the relative strengths/weaknesses of various choices.

I know this is a very broad request but I am curious about why AMD/Nvidia have made different choices, where Kepler may go in a direction more like AMD and/or cases where NV is going a different direction. It seems they have some fundamental differences in scheduling, APUs, rasterizes, etc. Likewise getting into some of the finer details of things like why AMD has stuck to 2 verts a clock whereas NV has moved to 8, and how architectural decisions play into this and so forth.

What I am not looking for others to compare and contrast are physical implementations or random guesses/conjectures/fan driven noise. If I wanted that I would be offering my own comparisons and contrasts but I know I cannot offer anything other than the superficial comparisons.

Part of this thread was prompted by discussions about scheduling and how NV and AMD are going about this differently, obvious things like the scalar and SIMD APU designs, and the other part was Rys comment (not sure if he was serious or not) that:

This is the best desktop graphics architecture and physical implementation ever. Some rough edges, but that's the long and short of it.

Ok, for arguments sake let's say that is an accurate comment, why (beyond being new, DX11.1, etc) is GCN the best desktop graphics architecture? In what ways has GCN gone ahead of Fermi, where is Fermi's design still better, and in what ways is the direction Fermi going a better/worse route?
 
What still baffles me is that a GTX580 performs incredibly poorly in the targeted synthetic benchmarks on ixbt.com and in pure specs (texture units, bandwidth, Gflops etc), yet still performs acceptably in the gaming benchmarks.

There is obviously something that makes one architecture more overall efficient than the other, but it looks that this factor is much larger that one would intuitively thinks. It's almost as if you could multiply the pure specs of a GF110 by a certain value higher than 1 (say, 1.4 or so) before you can start to compare the numbers with an AMD chip.

It'd be incredibly interesting to understand what makes a particular architecture more efficient than the other, but I don't know if it's at all possible. It may be quite literally hundreds of different things, FIFO sizings etc, that each make a difference of 0.1% but accumulate into something really meaningful.

(Please, don't start throwing perf/mm2 and perf/W numbers at me, that's not what I'm talking about.)
 
GCN is decidedly better IMO too, barring a few rough edges. Comparing it to Fermi is a bad idea since they are not contemporaries.
 
It's curious that there's so much talk about nVidia following in AMD's footsteps when it's GCN that has actually taken a huge step toward the Fermi way of doing things. Dropping VLIW is a pretty big deal and there's the introduction of proper caching, ECC etc. The massive improvements in some compute workloads totally justify the move.

I honestly don't see much else that's different now that wasn't different last generation too. AMD still handles geometry in dedicated units outside of the main shader core and instruction scheduling relies on simple (but smart) wavefront level scoreboarding. The GDS lives on as well.

GCN seems to be a very well balanced architecture with no obvious weak points just yet. It has completely shed the VLIW albatross. Fermi's biggest failing was power consumption but it's a pretty well balanced arch as well - more so than Evergreen or NI. If Kepler addresses that particular issue it may be in the running for Rys' "best architecture ever" award :)
 
Ok, for arguments sake let's say that is an accurate comment, why (beyond being new, DX11.1, etc) is GCN the best desktop graphics architecture? In what ways has GCN gone ahead of Fermi, where is Fermi's design still better, and in what ways is the direction Fermi going a better/worse route?

For graphics it certainly appears that GCN's approach to geometry is more practical and effective. Fermi's distributed geometry processing hasn't really been fruitful.

On the flip side Fermi does a whole lot more with a whole lot less. Just look at arithmetic, texture and ROP theoreticals. As silent guy said there's a lot of secret sauce at work that isn't apparent in press deck diagrams. There are fewer differences this generation so it should actually get easier to spot the strengths and weaknesses of each arch.

The potential differences this round could be:

- Distributed vs centralized geometry processing
- Scalar unit vs no scalar unit
- Special functions running on dedicated vs general hardware
- Batch size of 64 vs 32
- Number of registers per thread (GCN is now around Fermi levels I think)
- Handling of atomics
- ILP (GCN abandons this completely)
- On-chip bandwidth
- Memory/ALU co-issue
 
On a global level, Fermi has a scheduler that builds kernels and farms work out to each SM, which is probably the closest thing equivalent to a CU.

The arrangement for Tahiti involves more units, AMD seems to put the command processor, a CS pipe, a work distributor, primitive pipes, and pixel pipes in a box labeled Scalable Graphics Engine.
That block is probably responsible for maintaining the heavier graphics context, and I think this includes among other things the primitive pipes and ROPs.
Alongside the graphics engine are two ACE blocks, which skip trying to maintain the API abstraction and just do compute.
Fermi's global scheduler can theoretically track 16 different kernels, the GCN slides indicate multiple tasks can operate on the front end, but no number is given.

The front end of Tahiti versus Cayman has at least one simplification, since the formerly global clause scheduling hardware has now been distributed to the CUs.


A CU versus SM comparison has a number of differences such as whether there is a hot clock and the differing number of SIMD blocks per unit.
The division of resources for read and write is different. Fermi has a 64 KB pool of memory that is both the L1 and LDS. There is a small texture cache off to the side as well.
Tahiti has an L1 data/texture cache, and the LDS is off to the side.
There doesn't seem to be an explicitly separate SFU for Tahiti, but it does have an explicit scalar unit with a shared read-only cache supporting it.

Fermi has two schedulers per SM, each which does dependence tracking and handles variable cycle operations thanks to the different units it can address.
Tahiti has four schedulers, but a more uniform operation latency and conservative issue logic means it does not track dependences, just whether the last instruction completed (with a few software-guided runahead cases).
Fermi's ALUs work with one 128K register file, whileTahiti has split its register file into 4 64K files local to each SIMD.

The philisophical difference between Fermi and Tahiti probably is stronger than the physical differences. The introduction/exposing of the scalar unit and dispensing with the clause system has meant it has given up on the SIMT pretense. The architecture is explicitely SIMD, whereas prior chips tried to maintain a leaky abstraction of thread=SIMD lane.
There is a lot less hidden state, though I believe AMD has indicated that there is still some with the texture path.

Tahiti's write capability brings it closer in line with Fermi. If a cache line is to be kept coherent, it seems Tahiti is more aggressive at writing back values at wavefront boundaries, while Fermi will not force full writeback until a kernel completes.
Fermi's export functionality seemed to piggyback on the cache hierarchy, with the ROPs using the L2 instead of having their own caches. Its GDS is global memory.
GCN does not seem to do this. There are export instructions and a GDS. The ROPs are drawn as being distinct from the L2 and not ganged to a memory channel like the cache is. There is an export bus distinct from the L2 datapath.
 
Apparently GCN was made to play games whose engine works like Serious Sam 3 BFE:

http://www.pcgameshardware.de/aid,8...xpress-30-und-28-nm/Grafikkarte/Test/?page=13

Or maybe I should say this game seems to suit AMD cards. Or maybe the game just has the kind of complexity that AMD's drivers can cope with.

I'm afraid to say these days there seems little point trying to understand why because there's simply not enough raw data to pick apart the interesting parts of architectures.
 
Well, the game's engine, being DX9 and stuffed with a ton of effects, is probably very heavy on pixel/texture fill-rate and certainly can utilize the generous bandwidth upgrade and raw sampling throughput.
 
I'm curious how much memory bandwidth plays into this - the HD7970 has 'only' 37.5% more memory bandwidth than the GTX 580 (and considering how GDDR5 works the real usable difference might even be slightly lower than that). It would be very interesting to see how these two GPUs compare with either severely underclocked memory clocks (to test bandwidth efficiency) or with severely underclocked core clocks (to test performance with little bandwidth limitation).
 
I'm curious how much memory bandwidth plays into this - the HD7970 has 'only' 37.5% more memory bandwidth than the GTX 580 (and considering how GDDR5 works the real usable difference might even be slightly lower than that). It would be very interesting to see how these two GPUs compare with either severely underclocked memory clocks (to test bandwidth efficiency) or with severely underclocked core clocks (to test performance with little bandwidth limitation).
Incidentally, computerbase ran this benchmark on the 7970 with the memory bandwidth of the 6970. http://www.computerbase.de/artikel/...deon-hd-7970/20/#abschnitt_384_bit_in_spielen - judged by these results bandwidth isn't really important in that benchmark, a 10% increase for 50% more memory bandwidth is not much. This is without SSAA though I guess the results could be potentially quite different (or not...) otherwise.
 
Incidentally, computerbase ran this benchmark on the 7970 with the memory bandwidth of the 6970. http://www.computerbase.de/artikel/...deon-hd-7970/20/#abschnitt_384_bit_in_spielen - judged by these results bandwidth isn't really important in that benchmark, a 10% increase for 50% more memory bandwidth is not much. This is without SSAA though I guess the results could be potentially quite different (or not...) otherwise.
In the bring up and optimisation process we ran Tahiti in 8CH configuration as well as 12 to compar how it was dong against Cayman. It was common to so 20%-30% performance improvements for he 12 channel case.
 
In the bring up and optimisation process we ran Tahiti in 8CH configuration as well as 12 to compar how it was dong against Cayman. It was common to so 20%-30% performance improvements for he 12 channel case.
Very nice - so it does benefit significantly from the higher bandwidth but not so much that it's really bandwidth starved. I assume 12CH vs 8CH is a bigger difference than 2/3rd clock rate like computerbase tested because of how the memory controllers and GDDR5 error correction work. If you look at the article, some games barely scale at all with memory bandwidth (2-3% at most) while many others scale between 20 and 25% so the average for games that do scale at all with it is pretty good.

So the good news there is that if future drivers improve shader core performance a lot then there'll be enough bandwidth to keep it fed. However there is still one thing I do not understand about the computerbase article: why do games barely scale more with 4xAA/16xAF than without it? There are plenty of TMUs so presumably it's not being too limited by AF filtering performance. Is it being bottlenecked by the 32 ROPs? I don't think it makes sense to that extent but I don't understand what else it could be... Surely your framebuffer compression algorithms aren't so good that MSAA doesn't increase bandwidth at all! :)
 
So the good news there is that if future drivers improve shader core performance a lot then there'll be enough bandwidth to keep it fed. However there is still one thing I do not understand about the computerbase article: why do games barely scale more with 4xAA/16xAF than without it? There are plenty of TMUs so presumably it's not being too limited by AF filtering performance. Is it being bottlenecked by the 32 ROPs? I don't think it makes sense to that extent but I don't understand what else it could be... Surely your framebuffer compression algorithms aren't so good that MSAA doesn't increase bandwidth at all! :)
I think that GCN is definitely limited somehow on a ROP level -- alpha blending rates make it clear that BW is not the problem in this case. Regarding AF, I don't think the count of the texturing units alone is a decisive factor for the performance hit -- raw texturing throughput doesn't diminish the AF sampling latency, so relative performance hit is more or less a constant here. One thing is different here - texture L1 is now a completely reworked by how it operates, from the previous architectures, ever since RV770. On top of that, the access to this L1 R/W cache now is shared by the TMU and the ALUs, while Fermi kept the dedicated texture streaming cache, alongside the new L1, mostly used for register spills.
 
What still baffles me is that a GTX580 performs incredibly poorly in the targeted synthetic benchmarks on ixbt.com and in pure specs (texture units, bandwidth, Gflops etc), yet still performs acceptably in the gaming benchmarks.

My opinion here is that we're walking quite a few miles along the complexity path ignoring the obvious answer to this question quite frequently - hardware is little to nothing without software. NVIDIA gets two rather important things very right: investment in their own SW, which translates into a very solid SW stack across the board, from their graphics/gaming drivers, to the compute side, and investment in devrel, which in this business more often than not means they get to do the work and just ship done code that hooks into some codebase.

Neither games nor the early compute efforts are pedal-to-the-metal all around hyper optimised efforts on any level (for various reasons that would sit well in a separate discussion), so having a worse driver stack and less optimisation and qualification done on your HW is not something that can easily be undone through sheer throughput on ATI's side. The red herring of ATI VLIW utilisation and whatnot will be gone from future debates, I guess, but the underlying SW weakness will remain for the forseeable future. Directed tests take away almost all of the burden from the software stack, since it gets fed simplistic and often optimised code, so it's far harder for it to flounder, hence why they end up more accurately following hardware differences.
 
http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/25

7970 is losing to 580 by 10-30% in places. I don't think drivers are so horrible or the code is so biased that 7970 is losing.

I do to a significant extent, especially given what those tests are. I think overestimating the prowess of ATI's drivers for intricate work or the maturity of their CL stack whilst at the same time underestimating the impact of not being the lead development platform is a risky business.
 
In the bring up and optimisation process we ran Tahiti in 8CH configuration as well as 12 to compar how it was dong against Cayman. It was common to so 20%-30% performance improvements for he 12 channel case.
Yes, that sounds more like it. The cb numbers looked lower than what I expected on average too (highest was 27% which is more what I expected as average), being only ~20%. Maybe the mix of games (or the settings used) there just isn't very bandwidth sensitive.
 
Back
Top