NVIDIA Fermi: Architecture discussion

Brilliant post there, Jawed. One worthy of archiving.

:love: :love: :love:

One minor comment. Aren't the caches on Intel CPU's inclusive? That would make L1 a part of L2, so it would not be 32K L1 + 256K L2.
 
Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.

Memory is another problem, and here I think Fermi's advantage is even more. Fermi has 128KB L2 cache per memory channel (64 bits), which should help a lot.

Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.

My point is, for some applications, NVIDIA may already have a performance advantage, so you don't have to wait a year or two.

Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.

That's true. I hope when AMD release their OpenCL implementations, they can also release some hints about optimizing for RV870, like NVIDIA's performance guide. Anything will help.
 
Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms
Why should it? With the addition of atomics GPUs became as general as needed IMO. It's just about performance now, better caching, lower cost fencing, lower cost atomics. The L2 should help a lot in that regard ... perhaps when constructing bounding volume hierarchies? Or building the data structure for an irregular Z-buffer?
 
Last edited by a moderator:
Theoretically this shouldn't cost anything at all, because AFAIK registers are accessed on a simple bus connecting memory locations together. But I could be wrong.
There's always a trade amongst capacity, granularity, bandwidth and latency.

Each of the 4 banks (X, Y, Z and W) in the existing design fetches 4 scalars (i.e. a scalar for 4 strands) from 3 different addresses, i.e. one address per clock (fourth clock being for other operations, i.e. TEX/LDS and DMA).

e.g.

Code:
X - 2 3 7
Y - 3 4 8
Z - 1 5 9
W - 1 2 6

You are proposing fetching from 3 addresses per clock, i.e. like Larrabee. Larrabee's register file is small (32KB). Despite that, it's still quite costly to implement 3 address generators and 3 ports, rather than the single address generator and 3 virtual ports in ATI's design.

Jawed
 
Aren't the caches on Intel CPU's inclusive? That would make L1 a part of L2, so it would not be 32K L1 + 256K L2.
Larrabee's cache architecture is a bit of a murk at the moment. I expect GF100's cache is inclusive for texels at least. In ATI L2 and L1, dedicated to texels, are inclusive. Indeed, texels will appear in multiple L1s in normal rendering.

There's too much unknown about all 3 of these GPUs right now, even if R800 is the best known. There are supposed to be compute-specific tweaks in there. For instance AMD introduced atomics. GF100's atomics are "done right", i.e. cached (unlike the performance-cliff disaster-zone of GT200). There's no reason why ATI's aren't, either. But the bigger picture of ATI's design is another murk. And I can't help thinking that R900 is an overhaul.

Jawed
 
Larrabee's cache architecture is a bit of a murk at the moment. I expect GF100's cache is inclusive for texels at least. In ATI L2 and L1, dedicated to texels, are inclusive. Indeed, texels will appear in multiple L1s in normal rendering.

OTOH, I am pretty sure that lrb has inclusive caches. After all, Intel has almost always had inclusive caches on it's cpu's.

And I can't help thinking that R900 is an overhaul.

Is it just the perf scaling disappointment from cypress, or the rumours of an upcoming real dx11 gpu?
 
Memory is another problem, and here I think Fermi's advantage is even more. Fermi has 128KB L2 cache per memory channel (64 bits), which should help a lot.
It appears that L2 is read-write shared by texels/fetches and render-targets/global-memory-resources.

HD5870 has 128KB L2 per MC dedicated to texels (read). The question is whether there's a cache for writes. The capability for the texture units to read the render target (for SSAA) appears to indicate that there is a cached write-read path on-die (which also supports fast global atomics?). But it might not be what we'd like to call a cache (hence my expectations for a proper overhaul in R900).

The key advantage that GF100 appears to have is that each core has a unified read/write L1 that caches global memory. Though some L1 appears to be always lost to shared memory, as a minimum of 16KB of shared memory is always allocated (in that case this memory can be used as register spill).

My point is, for some applications, NVIDIA may already have a performance advantage, so you don't have to wait a year or two.
I won't disagree, but it's still too early to compare.

That's true. I hope when AMD release their OpenCL implementations, they can also release some hints about optimizing for RV870, like NVIDIA's performance guide. Anything will help.
I'm pessimistic.

Also, I'm not sure what to make of the prospect of an OpenCL-extensions arms race, like the OpenGL arms race that's been going on for years now.

Jawed
 
Why should it? With the addition of atomics GPUs became as general as needed IMO. It's just about performance now, better caching, lower cost fencing, lower cost atomics. The L2 should help a lot in that regard ... perhaps when constructing bounding volume hierarchies? Or building the data structure for an irregular Z-buffer?
For example AMD added a SAD instruction, which "makes motion estimation faster in video-encoding". That's uber-specific.

I was hoping the Fermi whitepaper would mention something about Scan (bitwise Scan that I linked patent documents for, a while back), whether it's an intrinsic - and much more interesting than SAD.

That would make a whole pile of stuff much faster, I would hope. (It might turn out that as an intrinsic required for append/consume, ATI does this too - who knows eh?)

It's also like shared memory was a new kind of feature (moving away from the streaming model) that in something like folding@home (for small proteins at least) results in a radical performance advantage.

Fast register spills to/from memory (via cache) is theoretically explosive.

Jawed
 
I don't think you understand what I'm saying. I'm not saying you need 4 times the register file, nor 4 times the BW.

You're just saying you need 4X as many ports or banks on a given amount of register file space. Which is really really bad. I applaud you for looking into things like this as it's interesting to think about, but it would be helpful if you thought about the hardware implications.

How much extra area/power do you think this would chew up?

As an example, if you have 64 batches needing 4kB of register space each (16 floats per thread), then the current design will only have to possibly access from a 8 kB subset of the 256 kB register file during any 8 cycle period. My proposal will have to access from a 32 kB subset. Both designs fetch 1 kB of data per cycle.

It's not how much data you fetch, its where you fetch from. If you want to fetch 1KB of data from a single register file...you're DOA. Fetching from 2 reg files is still awful...but fetching from 16 different ones might be reasonable.

Theoretically this shouldn't cost anything at all, because AFAIK registers are accessed on a simple bus connecting memory locations together. But I could be wrong.

I think it would behoove you to understand how register files are designed in GPUs. There is at least one patent that I referenced in my article.

David
 
Is it just the perf scaling disappointment from cypress, or the rumours of an upcoming real dx11 gpu?
Lots of games, going back, don't scale so well with new GPUs from prior generations, so I'm less bothered by the scaling. There are games that are scaling by 80%+ on HD5870 over HD4890.

Though things like edge-detect MSAA don't look so happy - I'd like to see a detailed investigation of that. It's interesting to read:

http://www.bungie.net/images/Inside/publications/siggraph/AMD/Yang_AMD_Siggraph2009.pptx

which describes the algorithm - it's quite hairy.

I'm thinking more generally about compute (which with D3D11, alone, makes memory a first-class read-write resource) and the losing-race fixed-function GPUs are in with something like Larrabee.

Jawed
 
The P4 has 4 instruction streams (2 threads/core), and can simultaneously execute 128b SIMD ops from each instruction stream.
Ok, I see your point. I've been looking at it too much from a micro level.

How is Fermi MIMD in a way that Nehalem or a dual-core P4 isn't?
I was being a facetious, sorry.

My issue is that calling a GPU MIMD outright hides the soul of the architecture. You can very well argue that RV870 or GT200 are MIMD too, but that helps little from a practical engineering point of view (SW or HW).

But, again, I see the point. I'll shut up for now. ;)
 
Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.

In double-precision, ATI's ALUs are scalar for MULs and MADs and vec2 for ADDs. GF100 at 1.5GHz will be slower for DP-ADD than HD5870.


Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.

Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.



Broad comparison of compute at the core level:
  • ATI (mostly ignoring control flow processor and high level command processor)
    • thread size 64
    • in-order issue 5-way VLIW
    • slow double-precision
    • "statically" allocated register file, with spill (very slow spill?) and strand-shared registers
    • large register file (256KB) + minimal shared memory (32KB) + small read-only L1 (8KB?)
    • high complexity register file accesses (simultaneous ALU, TU and DMA access), coupled with in-pipe registers
    • separate DMA in/out of registers instead of load-store addresses in instructions
    • stack-based predication (limited capacity of 32) for stall-less control flow (zero-overhead)
    • static calls, restricted recursion
    • 128 threads in flight
    • 8 (?) kernels
  • Intel (ignoring the scalar x86 part of the core)
    • thread size 16
    • in-order purely scalar-issue (no separate transcendental unit - but RCP, LOG2, EXP2 instructions)
    • half-throughput double-precision
    • entirely transient register file
    • small register file, large cache (256KB L2 + 32KB L1) (+ separate texture cache inaccessible by core), no dedicated shared memory
    • medium complexity register file (3 operands fetch, 1 resultant store)
    • branch prediction coupled with 16 predicate registers (zero-overhead apart from mis-predictions)
    • dynamic calls, arbitrary recursion
    • 4 threads in flight
    • 4 kernels
  • NVidia (unknown internal processor hierarchy)
    • thread size of 32
    • in-order superscalar issue across three-SIMD vector unit: 2x SP-MAD + special function unit (not "multi-function interpolator")
    • half-throughput double-precision
    • "statically" allocated register file, with spill (fast, cached?)
    • medium register file + medium-sized multi-functional cache/shared-memory
    • super-scalar register file accesses (for ALUs. TUs too?)
    • predicate-based stall-less branching (with dedicated branch evaluation?)
    • dynamic calls, arbitrary recursion
    • 32 threads in flight
    • 1 kernel
I don't think it's been commented-on explicitly, so far in this thread, but NVidia has got rid of the out-of-order instruction despatch, which scoreboards each instruction (to assess dependency on prior instructions). Now NVidia is scoreboarding threads, which should save a few transistors, as operand-readiness is being evaluated per thread and instructions are being issued purely sequentially.

Jawed

A few comments and thoughts - first of all your terminology is very very confused, largely due to the fact that all the different GPU guys have bizarre and non-standard terminology. Trying to straighten that out is a pretty heroic effort.

ATI
  1. ATI's cores are not the VLIWs, they are the SIMDs.
  2. It's not really clear how the calls work
  3. ATI has 128KB cache/memory controller
  4. It's a wavefront, not a thread.
  5. The shared memory is ~same as NV, the difference is shmem/vector lane.

Intel
  1. There are 16 vector lanes, not threads
  2. The L1D is shared memory
  3. The L1D can be accessed every cycle by the vector pipe
  4. Intel threads != NV or ATI threads

Anyway, I think the list is a great starting point for comparison, but needs to be cleaned up a bit. Thanks for putting it together.


David
 
I'm not so sure that Larabee is actually half the rate for double precision. If that's half the vector width it doesn't sound like more than 1/4th in real time.
 
Comments:

ATI's cache hierarchy has a number of peculiarities. Please correct me as I may be wrong.

1. For ATI, there is also 64kb GDS which should be usable on RV8xx.

2. For RV770, AFAIK reads from the global buffer were not cached and I certainly always got much less effective read bandwidth in my tests from the global buffer than 2d resources.

3. David Kanter already posted about 128kb L2/MC = 512kb L2 on RV8xx. This is read-only.
 
Comments:

ATI's cache hierarchy has a number of peculiarities. Please correct me as I may be wrong.

1. For ATI, there is also 64kb GDS which should be usable on RV8xx.

Maybe, sometime in the future once it gets exposed. For the time being you can't explicitly leverage it.
 
I'm not so sure that Larabee is actually half the rate for double precision. If that's half the vector width it doesn't sound like more than 1/4th in real time.
No, some powerpoint presentations Intel released when turned into a wav file and played backwards prove without a doubt Larrabee is half rate.
 
You are proposing fetching from 3 addresses per clock, i.e. like Larrabee.
I was when I thought ATI's architecture already does this, but all along I've been proposing that we fetch operands at the same rate as now. All I'm asking for is the ability to execute serially dependent instructions.

Like I said before, compilers have a lot of freedom with scalar code. If you have a bunch of serially dependent instructions, you can allocate the registers so that their data is distributed across the channels, minimizing register port limitations.
 
Brilliant post there, Jawed. One worthy of archiving.

:love: :love: :love:

One minor comment. Aren't the caches on Intel CPU's inclusive? That would make L1 a part of L2, so it would not be 32K L1 + 256K L2.

Intel CPUs are inclusive, AMD are exclusive (L1/L2, not sure about the new L3).
 
Back
Top