NVIDIA Fermi: Architecture discussion

Anyone seen any TMUs? :p

Jawed

I've asked already Rys where the 16 pixels/clock address & setup per SM come from but I'm still waiting for his answer. I don't think the 16 load/store units have anything to do with it.
 
http://www.nvidia.com/object/pr_oakridge_093009.htmlhttp://www.nvidia.com/object/io_1254288141829.html

SANTA CLARA, Calif. —Sep. 30, 2009—Oak Ridge National Laboratory (ORNL) announced plans today for a new supercomputer that will use NVIDIA®’s next generation CUDA™ GPU architecture, codenamed “Fermi”. Used to pursue research in areas such as energy and climate change, ORNL’s supercomputer is expected to be 10-times more powerful than today’s fastest supercomputer.

Jeff Nichols, ORNL associate lab director for Computing and Computational Sciences, joined NVIDIA co-founder and CEO Jen-Hsun Huang on stage during his keynote at NVIDIA’s GPU Technology Conference. He told the audience of 1,400 researchers and developers that “Fermi” would enable substantial scientific breakthroughs that would be impossible without the new technology.

“This would be the first co-processing architecture that Oak Ridge has deployed for open science, and we are extremely excited about the opportunities it creates to solve huge scientific challenges,” Nichols said. “With the help of NVIDIA technology, Oak Ridge proposes to create a computing platform that will deliver exascale computing within ten years.”

ORNL also announced it will be creating the Hybrid Multicore Consortium. The goals of this consortium are to work with the developers of major scientific codes to prepare those applications to run on the next generation of supercomputers built using GPUs.

“The first two generations of the CUDA GPU architecture enabled NVIDIA to make real in-roads into the scientific computing space, delivering dramatic performance increases across a broad spectrum of applications,” said Bill Dally, chief scientist at NVIDIA. “The ‘Fermi’ architecture is a true engine of science and with the support of national research facilities such as ORNL, the possibilities are endless.”
Groovy.

Jawed
 
The design put forth really hammers some low-hanging fruit that earlier Nvidia GPUs (and others) lacked.
The multiple kernels, the closed/semi-closed write/read loop, 1/2 DP throughput.

The other stuff is downright crazy to see: indirection, exceptions, IEEE compliance, ECC, simplified addressing.

The mapping of separate memory spaces to lie within the global address space is an elegant way to have the benefits of hardware peculiarity in a more specialized instance without having it impinge the general computation case.

I had sort of thought of a design using special page table bits that would allow hardware to route to special on-chip storage if enabled, and easily forgettable if not.
This isn't quite the same, but the idea of using the target memory location to deliniate special things you want done with it is a rather nice touch.

The size of the chip shows the price of generality, though. FLOP density is not likely to be anywhere near Cypress, and I'd be curious to know if Larrabee's final clocks will mean even the x86 will have an advantage.

I don't know how it will fare in gaming, or how many other problems there may be, but I have to give Nvidia credit: this design took balls.

As far as DP is concerned, the quality of this implementation is enough to make Cypress appear as useful as its botanical namesake in HPC.

Physical and economic realities that may intrude on this (it doesn't exist on a store shelf), but as a topic of discussion, I find this architecture much more interesting to discuss.
The posited tool sets and initiatives are such that this is the first time I've ever thought a GPU designer took serious computation seriously.
 
I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.
 
You mean, cos coherency was too damn difficult :LOL: We have to wait for Einstein for full coherency?

Jawed
 
I'm curious about the bandwidth available for transfers through the L2.
Larrabee has the ring bus, while Cypress has that read-only crossbar.

The neat part is that it is fully possible to write code that can write to non-coherent space.
No cache hints, no separate load instructions, just the allocation for shared memory.

It can be done either way. It seems to be best of both worlds (though coherence is much more relaxed than Larrabee, though stricter than Cypress' nothing).
I wonder what kind of overheads are involved, and what protocol is used.
 
Oh, I obviously meant 'execution' as in 'execution units'; the scheduling hardware is still very much there and busy. If you do need a large amount of both FP (especially @DP) and cheap INT stuff, then the 'total' overhead is much larger than it 'needs' to be but not too awful (I'll admit to not fully know what the branching hardware can do on its own though, if much of anything). This is not an usual case, although as I said this is still (less) relevant to the many cases where you've got more MUL/ADDs than MADDs.

The decision of supporting cheaper operations faster must obviously be based on the cost of higher instruction issue and, critically, virtual RF ports. Given many of the key architectural details I hadn't been expecting (exceptions, SP denorms, and the list goes on and on), it's very clear even to me that this approach makes good sense (especially given the usual workloads). In other architectures (one example idea: 3-way VLIW that shares 6 RF ports), making cheaper operations faster would obviously requires less extra overhead and would make more sense. You just can't have your cake and eat it too.
 

So it will be something like 10 times the jaguar or roadrunner I guess.

How many GPUs roughly would that be? 10 PFLOPS maybe around 5000 nodes?

Edit: I was discussing this a couple of months before, because the NEC supercomputers seemed low in flops when compared to GPUs, and the issue of available BW between nodes (inthe GPU case, whereas the NEC has plenty) was big for some applications. I hope they solve this problem in some interesting way with this new super, unless they will specialize on very parallel workloads only.
 
83270810.jpg

90623349.jpg

82460109.jpg

51063173.jpg
 
Oh, I obviously meant 'execution' as in 'execution units';

Are the transistors really separated like that though? I would have thought that the execution units weren't quite as discrete as shown in the block diagram. int math and mantissa fp math share a lot in common, no?
 
You need to give the Nvidia engineers a little more credit :)
Wait, are you implying there's something I'm missing about the architecture? :) I assume you can't say, but if not and you're just saying I should be more enthusiastic, then don't get me wrong! This is a very very impressive solution for HPC, and from that point of view it's also a very exciting architecture with lots of nice things. The dual-scheduler approach isn't what I was expecting but it's definitely elegant. All this doesn't mean it's the best architecture for all possible purposes (nothing could ever be) and I was just pointing out one potential case where its weaknesses might be especially pronounced *if* I understood the architecture correctly. Here's hoping I didn't...

dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)
 
I am waiting to see how many ppl will start to complain about the introduction of some sort of a semi-coherent cache.
L2 is coherent with itself, there is a single L2 for each memory bus ... there can only ever be one copy.

That's not a cache coherency scheme, that is simply caching.
 
Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)
In graphics I expect the 32-bit integer units, with all those bit manipulation capabilities, will be doing texturing while the floating point units are doing shader arithmetic - unless of course you have some integer shader math to do, in which case that'll get its turn.

Jawed
 
dnavas: I don't know if they're separate like that, but one GT200 diagram at the Tesla Editor Day clearly indicated separate INT units and then an engineer told me outright that was only marketing when I asked. Maybe it's the same this time around, or maybe it isn't. Heck, maybe that's what Bob is implying here! (i.e. there are cases where the units can actually both be used at the same time)

I somehow have to believe that, if the architecture under very special circumstances might be able to perform simultaneous integer and floating point operations, marketing would find a way to say "3 trillion ops" rather than "1.5 trillion flops". Not that they've [missing] ever done [mul] something like that before....

Just saying :)

Is that really 256 TUs? 16 bilerps AND 16 fetches per? That seems somewhat insane.

-Dave
 
Back
Top