NVIDIA Fermi: Architecture discussion

Dave Baumann · Sep 30, 2009

fellix said:
Some fun at the perimeter:

Damn, must make a plea to the planners and engineers to put a "cookie monster" in our ASIC's!

Ailuros · Sep 30, 2009

Jawed said:
Anyone seen any TMUs?

Jawed

I've asked already Rys where the 16 pixels/clock address & setup per SM come from but I'm still waiting for his answer. I don't think the 16 load/store units have anything to do with it.

Jawed · Sep 30, 2009

http://www.nvidia.com/object/pr_oakridge_093009.htmlhttp://www.nvidia.com/object/io_1254288141829.html

SANTA CLARA, Calif. —Sep. 30, 2009—Oak Ridge National Laboratory (ORNL) announced plans today for a new supercomputer that will use NVIDIA®’s next generation CUDA™ GPU architecture, codenamed “Fermi”. Used to pursue research in areas such as energy and climate change, ORNL’s supercomputer is expected to be 10-times more powerful than today’s fastest supercomputer.

Jeff Nichols, ORNL associate lab director for Computing and Computational Sciences, joined NVIDIA co-founder and CEO Jen-Hsun Huang on stage during his keynote at NVIDIA’s GPU Technology Conference. He told the audience of 1,400 researchers and developers that “Fermi” would enable substantial scientific breakthroughs that would be impossible without the new technology.

“This would be the first co-processing architecture that Oak Ridge has deployed for open science, and we are extremely excited about the opportunities it creates to solve huge scientific challenges,” Nichols said. “With the help of NVIDIA technology, Oak Ridge proposes to create a computing platform that will deliver exascale computing within ten years.”

ORNL also announced it will be creating the Hybrid Multicore Consortium. The goals of this consortium are to work with the developers of major scientific codes to prepare those applications to run on the next generation of supercomputers built using GPUs.

“The first two generations of the CUDA GPU architecture enabled NVIDIA to make real in-roads into the scientific computing space, delivering dramatic performance increases across a broad spectrum of applications,” said Bill Dally, chief scientist at NVIDIA. “The ‘Fermi’ architecture is a true engine of science and with the support of national research facilities such as ORNL, the possibilities are endless.”

Groovy.

Jawed

fellix · Sep 30, 2009

Dave Baumann said:
Damn, must make a plea to the planners and engineers to put a "cookie monster" in our ASIC's!

No, no.. the right words are: give me a wallpaper sized RV870 die shot, now!!!1!1one

On topic: Fermi board snapped ...sort of.

3dilettante · Sep 30, 2009

The design put forth really hammers some low-hanging fruit that earlier Nvidia GPUs (and others) lacked.
The multiple kernels, the closed/semi-closed write/read loop, 1/2 DP throughput.

The other stuff is downright crazy to see: indirection, exceptions, IEEE compliance, ECC, simplified addressing.

The mapping of separate memory spaces to lie within the global address space is an elegant way to have the benefits of hardware peculiarity in a more specialized instance without having it impinge the general computation case.

I had sort of thought of a design using special page table bits that would allow hardware to route to special on-chip storage if enabled, and easily forgettable if not.
This isn't quite the same, but the idea of using the target memory location to deliniate special things you want done with it is a rather nice touch.

The size of the chip shows the price of generality, though. FLOP density is not likely to be anywhere near Cypress, and I'd be curious to know if Larrabee's final clocks will mean even the x86 will have an advantage.

I don't know how it will fare in gaming, or how many other problems there may be, but I have to give Nvidia credit: this design took balls.

As far as DP is concerned, the quality of this implementation is enough to make Cypress appear as useful as its botanical namesake in HPC.

Physical and economic realities that may intrude on this (it doesn't exist on a store shelf), but as a topic of discussion, I find this architecture much more interesting to discuss.
The posited tool sets and initiatives are such that this is the first time I've ever thought a GPU designer took serious computation seriously.

nAo · Sep 30, 2009

I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.

Jawed · Sep 30, 2009

You mean, cos coherency was too damn difficult

We have to wait for Einstein for full coherency?

Jawed

3dilettante · Sep 30, 2009

I'm curious about the bandwidth available for transfers through the L2.
Larrabee has the ring bus, while Cypress has that read-only crossbar.

The neat part is that it is fully possible to write code that can write to non-coherent space.
No cache hints, no separate load instructions, just the allocation for shared memory.

It can be done either way. It seems to be best of both worlds (though coherence is much more relaxed than Larrabee, though stricter than Cypress' nothing).
I wonder what kind of overheads are involved, and what protocol is used.

Arun · Sep 30, 2009

Bob said:
How so?

Oh, I obviously meant 'execution' as in 'execution units'; the scheduling hardware is still very much there and busy. If you do need a large amount of both FP (especially @DP) and cheap INT stuff, then the 'total' overhead is much larger than it 'needs' to be but not too awful (I'll admit to not fully know what the branching hardware can do on its own though, if much of anything). This is not an usual case, although as I said this is still (less) relevant to the many cases where you've got more MUL/ADDs than MADDs.

The decision of supporting cheaper operations faster must obviously be based on the cost of higher instruction issue and, critically, virtual RF ports. Given many of the key architectural details I hadn't been expecting (exceptions, SP denorms, and the list goes on and on), it's very clear even to me that this approach makes good sense (especially given the usual workloads). In other architectures (one example idea: 3-way VLIW that shares 6 RF ports), making cheaper operations faster would obviously requires less extra overhead and would make more sense. You just can't have your cake and eat it too.

Bob · Oct 1, 2009

Arun said:
You just can't have your cake and eat it too.

You need to give the Nvidia engineers a little more credit

compres · Oct 1, 2009

Jawed said:
http://www.nvidia.com/object/pr_oakridge_093009.htmlhttp://www.nvidia.com/object/io_1254288141829.html

Groovy.

Jawed

So it will be something like 10 times the jaguar or roadrunner I guess.

How many GPUs roughly would that be? 10 PFLOPS maybe around 5000 nodes?

Edit: I was discussing this a couple of months before, because the NEC supercomputers seemed low in flops when compared to GPUs, and the issue of available BW between nodes (inthe GPU case, whereas the NEC has plenty) was big for some applications. I hope they solve this problem in some interesting way with this new super, unless they will specialize on very parallel workloads only.

Fusion · Oct 1, 2009

dnavas · Oct 1, 2009

Arun said:
Oh, I obviously meant 'execution' as in 'execution units';

Are the transistors really separated like that though? I would have thought that the execution units weren't quite as discrete as shown in the block diagram. int math and mantissa fp math share a lot in common, no?

Arun · Oct 1, 2009

Bob said:
You need to give the Nvidia engineers a little more credit

Wait, are you implying there's something I'm missing about the architecture?

I assume you can't say, but if not and you're just saying I should be more enthusiastic, then don't get me wrong! This is a very very impressive solution for HPC, and from that point of view it's also a very exciting architecture with lots of nice things. The dual-scheduler approach isn't what I was expecting but it's definitely elegant. All this doesn't mean it's the best architecture for all possible purposes (nothing could ever be) and I was just pointing out one potential case where its weaknesses might be especially pronounced *if* I understood the architecture correctly. Here's hoping I didn't...

dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)

MfA · Oct 1, 2009

nAo said:
I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.

L2 is coherent with itself, there is a single L2 for each memory bus ... there can only ever be one copy.

That's not a cache coherency scheme, that is simply caching.

R300King! · Oct 1, 2009

ShaidarHaran said:
So the 16 SMs are on the "north" and "south" sides of the chip w/PCI-e and GDDR5 interfaces along the borders, any guesses as to what's in the center? Especially the very center. Scheduling?

Looks like a ring bus!
*runs*

Jawed · Oct 1, 2009

Arun said:
Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)

In graphics I expect the 32-bit integer units, with all those bit manipulation capabilities, will be doing texturing while the floating point units are doing shader arithmetic - unless of course you have some integer shader math to do, in which case that'll get its turn.

Jawed

dnavas · Oct 1, 2009

Arun said:
dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)

I somehow have to believe that, if the architecture under very special circumstances might be able to perform simultaneous integer and floating point operations, marketing would find a way to say "3 trillion ops" rather than "1.5 trillion flops". Not that they've [missing] ever done [mul] something like that before....

Just saying

Is that really 256 TUs? 16 bilerps AND 16 fetches per? That seems somewhat insane.

-Dave

Razor1 · Oct 1, 2009

Hmm since the units are now FMA, there wouldn't be a missing mul or underutilized mul anymore. Right?

Arty · Oct 1, 2009

Rys said:
I think the tesselator is a software pipe with very little hardware support, too.

Does D3D11 compliance require hardware support, sorry for the dumb question Rys.

NVIDIA Fermi: Architecture discussion

Dave Baumann

Gamerscore Wh...

Ailuros

Epsilon plus three

Jawed

fellix

3dilettante

nAo

Nutella Nutellae

Jawed

3dilettante

Arun

Unknown.

Bob

compres

Fusion

dnavas

Arun

Unknown.

MfA

R300King!

Jawed

dnavas

Razor1

Arty

KEPLER

Similar threads