NVIDIA Fermi: Architecture discussion

DemoCoder · Oct 10, 2009

CouldntResist said:
There has been done tons of research about how either of these can be optimized out. The only price is to give up the ability of loading classes at runtime (as in JVM/.NET/Ruby/Python etc).

Yes, you can optimize lots away but a true polymorphic callsite is equivalent to a data-dependent branch. (my day job is working on devirtualizing optimizations of a static compiler)

I have no intention of acting as a programming-paradigm-definition-Nazi, but the 3 mentioned language traits: purity, functional and OO can be (and are!) mixed in any combination. Chosing OO paradigm doesn't bind you to side-effects, just like chosing functional paradigm doesn't bind you to Hindley-Milner type system (despite the fact that in most existing cases the opposite is true).

It doesn't but languages which combine the two tend to have excessively complex type systems. This is covered in detail I the original paper on Scala. FP languages with OO tend to go the route of typeclasses and avoid runtime types. Those thatdon't end up with a variant of CLOS.

dkanter · Oct 10, 2009

rpg.314 said:
I don't think so. If you can run 16 kernels in parallel on fermi, then how is not MIMD like? Perhaps it would be more appropriate to say that it is a transition from SPMD to MPMD.

MIMD, SIMD, etc. are all just relatively vague categorizations from the earlier days of computing. They are basically meaningless as most people use them.

SSE is a SIMD instruction set because you cannot simultaneously execute different operations on different elements of a vector. The same is true of a warp in NV's architecture - you end up predicating.

Fermi itself might be considered MIMD, but then again so is a dual-core P4...

MIMD, SIMD, SISD and MISD are all meaningless unless you specify the granularity.

David

3dcgi · Oct 10, 2009

rpg.314 said:
I don't think so. If you can run 16 kernels in parallel on fermi, then how is not MIMD like? Perhaps it would be more appropriate to say that it is a transition from SPMD to MPMD.

David is correct that the terminology is vague and depends on the level with which you look at things. Both AMD and Nvidia GPUs are SIMD in the sense that a single instruction is executed for every thread in a wavefront/warp and predication is used for situations where divergence occurs within a wavefront/warp.

Novum · Oct 10, 2009

3dilettante said:
At this point I'm modestly interested in how one can say Larrabee is somehow free of SIMD baggage without some intellectual dishonesty.
Maybe I'm getting hung up on the 512-bit SIMD unit per core.

With Fermi allowing 16 kernels to operate on-chip at the same time, how non-MIMD would it be compared to Larrabee?

LRB isn't MIMD either. It has 32 Cores with a 16 way SIMD units each.

psurge · Oct 11, 2009

Right, but as I understand it, each Larrabee core has 4 thread contexts, and those contexts are free to be executing different kernels. So for an equivalent number of cores, Larrabee can execute 4x as many different programs vs Fermi. This seems like it might be quite important for Larrabee, as the various kernels required for processing a single frame-buffer tile can be simultaneously executing on a single core; this is good for memory locality. On Fermi, if you wanted to keep a tile local to a core, it looks like you'd have to wait for each kernel to finish completely before starting the next one, which could really cut into the efficiency of the chip, especially for small tiles.

trinibwoy · Oct 11, 2009

Mintmaster said:
Basically the compiler makes a big dependency graph and groups together loads when it can. The average group size effectively multiplies latency hiding.

But in that simplistic scheme you are not necessarily optimizing latency hiding. I don't know how AMD's implementation works but if it is as you describe then clauses are made as big as possible with the expectation that all the latency in the running load clauses will be covered by available ALU clauses across all wavefronts. I can think of cases where that would be sub-optimal.

Mintmaster said:
TEX-5xALU-TEX-4xALU is worse than 2xTEX-9xALU.

For example, you say here that running all the loads and then all the ALU stuff is better. But what if I have LD-20xALU-LD-2xALU but the second LD takes an inordinate amount of time? So 2xLD-22xALU will unnecessarily stall the ALU pipeline. Or is that still not a problem when you have an abundance of threads?

silent_guy · Oct 11, 2009

dkanter said:
Fermi itself might be considered MIMD, but then again so is a dual-core P4...

If you're going to call a dual core P4 MIMD, then, yeah, all meaning has been lost. But nobody with only the slightest amount of HW architecture knowledge would do so without a very long disclaimer attached to it.

So Fermi is MIMD already. GT200 was MIMD also: after all, its different SM's are executing different instructions at the same time and some SM's can stop executing long before others are still running.

Still, I'm waiting anxiously to see exactly that kind of MIMD-like features are still being hidden from us.

gongo · Oct 11, 2009

Sorry to butt in on the pro discussions, i like to ask a noobz question, is there anywhere in this long thread mentioned about the estimated perf of GF100...?
Say with best case redundancy, all core units of the uarchitecture are enabled, core is clock at gtx275 and 384bit 1.5gb of 4.8ghz gddr5....where would this gpu stand...crysiz atz 60 frame per secz?

dkanter · Oct 11, 2009

silent_guy said:
If you're going to call a dual core P4 MIMD, then, yeah, all meaning has been lost. But nobody with only the slightest amount of HW architecture knowledge would do so without a very long disclaimer attached to it.

How so? The P4 has 4 instruction streams (2 threads/core), and can simultaneously execute 128b SIMD ops from each instruction stream.

So Fermi is MIMD already. GT200 was MIMD also: after all, its different SM's are executing different instructions at the same time and some SM's can stop executing long before others are still running.

Still, I'm waiting anxiously to see exactly that kind of MIMD-like features are still being hidden from us.

How is Fermi MIMD in a way that Nehalem or a dual-core P4 isn't?

DK

babcat · Oct 11, 2009

KonKort said:
No, I write MIMD-similar units and you will understand it, when Nvidia will launch the card.

I reported it on May 15th, the tape out was of course before (in which timeframe exactly, I do not know).

I am confused in these days. Could it be that 2.4B transistors and 512 Bit was planed for the dektop card and the reported 3.0B transistors and 384 Bit are based on the tesla card with it's many double precision units?
2547 Gigaflops are wrong. I speculate that the next generation chip will furthermore have MADD and MUL per core.

I'm hoping the gaming version of FERMI will have the double precision and super computing stuff yanked out. They need to have a version of this chip optimized to push graphics.

Mintmaster · Oct 11, 2009

trinibwoy said:
For example, you say here that running all the loads and then all the ALU stuff is better. But what if I have LD-20xALU-LD-2xALU but the second LD takes an inordinate amount of time? So 2xLD-22xALU will unnecessarily stall the ALU pipeline. Or is that still not a problem when you have an abundance of threads?

You're right on that last point. If you have a decent number of batches, it doesn't matter how the LD clauses are distributed in the program (unless, of course, you lump together so many loads that you need significantly more register space to store the values). All that matters for latency hiding is number of batches and average LD clause throughput. Practically speaking, though, tens of batches is statistically a bit insufficient, so a little bit of cleverness (or maybe just randomness) in scheduling helps to avoid what I guess could be called temporal batch aliasing.

In fact, because you grouped the loads together in your second scenario, that will definately have better latency hiding. Avg LD clause throughput goes down with larger avg tex clause size, so you get double the latency hiding for the same program.

Since you're showing such an interest in this, I'll send you a PM with my simulator.

dkanter · Oct 11, 2009

Mintmaster said:
Since you're showing such an interest in this, I'll send you a PM with my simulator.

What kind of simulator do you have?

DK

Mintmaster · Oct 11, 2009

dkanter said:
What kind of simulator do you have?

DK

Basically a visual aid to watch batches (same as wavefronts?) get processed each cycle by an ATI SIMD engine and see how latency affects efficiency for a given sequence of ALU and TEX instructions.

Jawed · Oct 11, 2009

Mintmaster said:
The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method.

Well, that's one way of reducing the compute density.

Jawed

Jawed · Oct 11, 2009

thambos said:
video of it from PCGH
http://www.youtube.com/watch?v=vPeZf8WJDM4&hd=1

So, that's a couple of orders of magnitude short of being real time, if you want >10fps.

Jawed

LunchBox · Oct 11, 2009

MfA said:
With AMD not playing that game they can't afford to cripple double precision for consumer cards, at least not performance wise ... they could disable exceptions or something I guess.

Very True

mmendez said:
You mean cheaper than using AMD/Intel CPUs, right?

Yes

Mintmaster · Oct 11, 2009

Jawed said:
Well, that's one way of reducing the compute density.

Jawed

I don't think you understand what I'm saying. I'm not saying you need 4 times the register file, nor 4 times the BW.

As an example, if you have 64 batches needing 4kB of register space each (16 floats per thread), then the current design will only have to possibly access from a 8 kB subset of the 256 kB register file during any 8 cycle period. My proposal will have to access from a 32 kB subset. Both designs fetch 1 kB of data per cycle.

Theoretically this shouldn't cost anything at all, because AFAIK registers are accessed on a simple bus connecting memory locations together. But I could be wrong.

DegustatoR · Oct 11, 2009

Jawed said:
So, that's a couple of orders of magnitude short of being real time, if you want >10fps.

Presumably it was ran on GT200 h/w.

Sxotty · Oct 11, 2009

DegustatoR said:
Presumably it was ran on GT200 h/w.

Since he said it was I would presume that is correct.

The physics simulation showed a 5 times speed up in double precision, so it might be possible to estimate the speed up here by comparing the hit in invoking double precision between the two architectures.

Jawed · Oct 11, 2009

pcchen said:
Well, the problem is, AMD's current design, although better in peak performance (density) and in many image processing works, it's not necessarily better in other works. It's much easier to write and optimize for a scalar model rather than a vector model.

Only as long as your algorithm has absolutely no dependency on memory performance. If there is any dependence, of course it's a matter of how much marginal improvement there is to be had in vectorising the data accesses.

In double-precision, ATI's ALUs are scalar for MULs and MADs and vec2 for ADDs. GF100 at 1.5GHz will be slower for DP-ADD than HD5870.

In the end, it's possible that many GPGPU programs may run faster on NVIDIA's GPU than on AMD's GPU even AMD's GPU has higher peak performance.

Yes, you can always wait a year or two for NVidia's performance to catch-up, in the mean time you've got a cosy coding environment and all the other niceties of NVidia's solution. And with CUDA, specifically (in theory anyway), NVidia is going places that AMD won't be bothered about for a couple of years - and there's a decent chance those things will improve performance due to better algorithms, so your loss is likely lower if your problem is at all complex.

Though right now I'm hard-pressed to name anything in Fermi that makes for better performance because it allows for more advanced algorithms (that's partly because I don't know if AMD has done lots of compute-specific tweaks - only hints and D3D11/OpenCL leave plenty of room). Gotta wait and see.

If this turns out to be true, there's no reason why NVIDIA should go AMD's route. Of course, AMD may tries to go NVIDIA's route, but even so they are not going to have very similar architectures, at least in near future.

Broad comparison of compute at the core level:

ATI (mostly ignoring control flow processor and high level command processor)
- thread size 64
- in-order issue 5-way VLIW
- slow double-precision
- "statically" allocated register file, with spill (very slow spill?) and strand-shared registers
- large register file (256KB) + minimal shared memory (32KB) + small read-only L1 (8KB?)
- high complexity register file accesses (simultaneous ALU, TU and DMA access), coupled with in-pipe registers
- separate DMA in/out of registers instead of load-store addresses in instructions
- stack-based predication (limited capacity of 32) for stall-less control flow (zero-overhead)
- static calls, restricted recursion
- 128 threads in flight
- 8 (?) kernels
Intel (ignoring the scalar x86 part of the core)
- thread size 16
- in-order purely scalar-issue (no separate transcendental unit - but RCP, LOG2, EXP2 instructions)
- half-throughput double-precision
- entirely transient register file
- small register file, large cache (256KB L2 + 32KB L1) (+ separate texture cache inaccessible by core), no dedicated shared memory
- medium complexity register file (3 operands fetch, 1 resultant store)
- branch prediction coupled with 16 predicate registers (zero-overhead apart from mis-predictions)
- dynamic calls, arbitrary recursion
- 4 threads in flight
- 4 kernels
NVidia (unknown internal processor hierarchy)
- thread size of 32
- in-order superscalar issue across three-SIMD vector unit: 2x SP-MAD + special function unit (not "multi-function interpolator")
- half-throughput double-precision
- "statically" allocated register file, with spill (fast, cached?)
- medium register file + medium-sized multi-functional cache/shared-memory
- super-scalar register file accesses (for ALUs. TUs too?)
- predicate-based stall-less branching (with dedicated branch evaluation?)
- dynamic calls, arbitrary recursion
- 32 threads in flight
- 1 kernel

I don't think it's been commented-on explicitly, so far in this thread, but NVidia has got rid of the out-of-order instruction despatch, which scoreboards each instruction (to assess dependency on prior instructions). Now NVidia is scoreboarding threads, which should save a few transistors, as operand-readiness is being evaluated per thread and instructions are being issued purely sequentially.

Jawed

NVIDIA Fermi: Architecture discussion

DemoCoder

dkanter

3dcgi

Novum

psurge

trinibwoy

Meh

silent_guy

gongo

dkanter

babcat

Mintmaster

dkanter

Mintmaster

Jawed

Jawed

LunchBox

Mintmaster

DegustatoR

Sxotty

Jawed

Similar threads