NVIDIA Fermi: Architecture discussion

Weird. It's almost like they're trying to convince, I dunno, some random manufacturer that their architecture is worth buying into. It doesn't seem at all to me that the gaming market was their target. And a number of gamers on here seem to have expressed a similar dissatisfaction.

I'm not sure how much more explicit Nvidia could have been. Their message is openly targeted at a different audience. It was as far from a graphics card launch as you could get. Now that could simply be because they had nothing to spoil with.

But really, some people are way too narrow minded. Launching a product is benchmarks + ooh looky more frames. That's pretty much all we got with RV870, and that's perfectly fine of course. But what they're trying to do here is get people to buy into a new way of doing things and trying to get the right people to take a look at GPUs as an option for HPC. You don't do that by throwing up framerate graphs.

If gamers are disappointed for now, so be it. AMD's new stuff has been out for less than a week and Nvidia's imminent demise is being heralded from rooftops everywhere. I would like a new graphics card too but they aren't talking to me right now. As always, people should spend their money based on the products currently available on the market. HD 58xx is a great product that should satisfy the needs of anyone looking for a new card right now.
 
You only have to look at the enthusiast sites to see that as for the many people who are dismissing Fermi, there's just as many who want to sell their 5870s in favour of GF100, even though there's nothing to buy yet, and who knows what the situation will be in another six months.

Really? I'm seeing the exact opposite.
 
And if those aspirations are real (i.e Nvidia is planning to launch their new GPU in Q1 2010, with the architecture they describe) I don't see what the problem is.

Nvidia are being honest about their plans and people can respond as they wish.

Aspirations are not necessarily what can be delivered. We've seen that plenty of times before, but rarely do companies blurt out all this info so far in advance for fear they can't reach their aspirations six months down the line, and getting crucified for it. It even cannibalises their own sales, but it may be more important to cannibalise your opponent's sales if he's taking your sales anyway.

Maybe I'm making a mistake thinking about it in terms of the graphics card market, when it seems Nvidia are going the GPCPU way instead, and trying to make their own market.
 
You don't think so? Frankly by all paper specs it should be faster. It should have more tmus (128?) and while it still has peak arithmetic deficiency compared to AMD's part, the gap should be smaller a bit (if nvidia reaches similar clock speeds as gt200), plus the improvements which should in theory help it achieve higher alu utilization. And it also has a ~50% bandwidth advantage. Granted, there are some open questions (as far as I can tell, the total special function peak rate hardly moved at all compared to gt200 - 64 vs 60 units for instance, but I'd guess it's still mostly enough) but if this thing can't beat a HD5870 in games nvidia shouldn't have bothered putting tmus/setup etc. on there at all and just sell it as a tesla card only... About the only thing I can think of why HD5870 would beat it in games is it could be severely setup/rasterization limited, if this didn't improve (especially given the seemingly compute-centric nature) then HD5870 might beat it in some situations simply because it runs at higher (core) clock and hence achieves higher triangle throughput.

128 TMU's 48 ROP.

5870 won't beat it, but a process mature RV870 could see higher clocked die candidates appear in OC models. Basically nullifying it's performance leadership in games, at a lower price.
 
Aspirations are not necessarily what can be delivered. We've seen that plenty of times before, but rarely do companies blurt out all this info so far in advance for fear they can't reach their aspirations six months down the line, and getting crucified for them. It even cannibalises their own sales, but it may be more important to cannibalise your opponent's sales if he's taking your sales anyway.

Maybe I'm making a mistake thinking about it in terms of the graphics card market, when it seems Nvidia are going the GPCPU way instead, and trying to make their own market.


How do you know its so far away?

And this card seems like it will do just fine in graphics, I don't see where anyone can think it won't at least not yet.
 
128 TMU's 48 ROP.

5870 won't beat it, but a process mature RV870 could see higher clocked die candidates appear in OC models. Basically nullifying it's performance leadership in games, at a lower price.


Don't know what the performance of this card is let alone what AMD will have to come up with clocks on a OC HD5870......
 
If the L2 cache is divided per memory controller such that each section covers the memory space covered by its linked controller, it does open up questions about contention.
Even if no real sharing occurs, happening to write the same portion of the address space could lead to many writes to the same L2 section.
It might inject some additional latency, and the program will need to try to keep accesses spread out to prevent internal bandwidth from going down to 1/6 peak in certain cases.

Visually, it's always struck me how different the caches appear in die shots between CPUs and GPUs.
CPU caches are very regular and grid-like. On-board SRAM on GPUs always looks so chaotic.
Is it a difference in how they take pictures, a difference in the amount of porting for those caches?
 
How do you know its so far away?

You think it's going to be out by March, when they didn't even have any working engineering samples to show? A couple of months for a respin, and if that works and gets yields to sensible levels, another few for stock means they'll be lucky to actually launch by Easter, unless you're talking about very limited press editions for the beginning of the year.

And this card seems like it will do just fine in graphics, I don't see where anyone can think it won't at least not yet.

There's no reason to think it will until we see that Nvidia can build it and make it a viable product.
 
Really? I'm seeing the exact opposite.

I'm seeing tech sites saying that Nvidia is ceding the gaming business to AMD as well. Personally I think they are trying to provoke a response so they can get a scoop.
Aspirations are not necessarily what can be delivered. We've seen that plenty of times before, but rarely do companies blurt out all this info so far in advance for fear they can't reach their aspirations six months down the line, and getting crucified for it. It even cannibalises their own sales, but it may be more important to cannibalise your opponent's sales if he's taking your sales anyway.

Maybe I'm making a mistake thinking about it in terms of the graphics card market, when it seems Nvidia are going the GPCPU way instead, and trying to make their own market.
Yeah it seems I heard something about some intel chip a while ago.
 
Visually, it's always struck me how different the caches appear in die shots between CPUs and GPUs.
CPU caches are very regular and grid-like. On-board SRAM on GPUs always looks so chaotic.
Is it a difference in how they take pictures, a difference in the amount of porting for those caches?
Rather sparse set of SRAM cells and arrays all over the place. Not that 768K is that big to be defined in a sea of 3 billion transistor silicon piece. ;)
 
On-board SRAM on GPUs always looks so chaotic.
Is it a difference in how they take pictures, a difference in the amount of porting for those caches?

64k cache on 40nm is pretty small? ~3E6 trannies, which is about .1% of the area of this die. Maybe the areas just aren't big enough?
 
In layman terms;

R600 - 320SP - 720 million - 2.25 MT per SP (Added D3D10)
RV670 - 320SP - 666 million - 2.08 MT per SP (Added D3D10.1 & lowered memory bus width)
RV770 - 800SP - 956 million - 1.19 MT per SP
Cypress - 1600SP - 2.15 billion - 1.34 MT per SP (Added D3D11)

G80 - 128SP - 686 million - 5.36 MT per SP (Added D3D10)
G200 - 240SP - 1.4 billion - 5.83 MT per SP (Added CUDA)
Fermi - 512SP - 3 billion - 5.85 MT per SP (Added more CUDA + D3D11 & lowered memory bus width)

Since we dont have any performance numbers, purely from this perspective it doesnt like Fermi is Nvidia's RV770 but still their jump looks more efficient.
I know I'm quoting myself but any one else surprised that the GF100 is still huge. I mean talking to Degustar & Ailuros for the past few weeks I was inclining towards a more leaner (G92ish) GPU from Nvidia but again we got a huge chip.

Again the disappointment also comes from a lack of launch date i.e. it is still vaporware with almost all sites claiming 2010. Huge and late is what I have a problem with.

Also Fuad is certain that there is a dual Fermi in the product stack. I dont see any one else mentioning this so I guess he's getting this again straight from NV PR.
 
You work with what you've got. The 58xx launch is now, the Nvidia spoiler has to be now. Given what little Nvidia have got to show beyond slides, simulated demos and mocked-up cards, I think they did a stellar job of stealing ATI's thunder as much as they could.

You only have to look at the enthusiast sites to see that as for the many people who are dismissing Fermi, there's just as many who want to sell their 5870s in favour of GF100, even though there's nothing to buy yet, and who knows what the situation will be in another six months.

Nvidia is emphasizing *other* than gaming with Fermi. They really believe it is a revolution to make GPUs general purpose parallel computing processors that also has amazing graphics and not just ‘graphics chips’ anymore.

Jensen believes that Fermi is the foundation for this new emerging industry. He could be right. I am not so sure they really know the gaming potential of their new GPU.
 
They have value ... but per cacheline MOESI is an extreme. In cost, the amount of effort necessary to scale it and fragility of scaled up implementations.
Well sure, but now it's just a standard cost vs. value discussion... clearly one solution will do better in some cases and the other in others. We'll have to start discussing specific cases to make any sort of global conclusion.

I used to be staunchly anti-cache-coherence myself for the same reasons, but through experience I've learned that it's a lot fuzzier than either side of the argument would like to think.

Anyways it will be interesting to see how much utility can be made out of GF100's L2$ by computing applications (including those doing graphics of course). Hopefully it will give some of the advantages of a more predictable "rounding off" of performance instead of typical GPUs which are an absolute minefield of performance disasters.
 
New generations and major strategic initiatives are fun. You get a lot of people making wild predictions that you can later use against them when they turn out hilariously wrong (I'm especially thinking of the articles linked in this thread and not so much everyone's posts don't worry). There are so many of them I frankly can't be bothered anymore. Hey, why is everyone looking at me now? Stop it! :p
(on this note, I probably should get back to work on my Exodus project and retrain my CUDA a bit for it - I knew I was going to spend more time on the forums that I wanted to because of this, I'm so predictable!)
 
Slower, why?
We don't know how either chip performs yet...
My interpretation is that Fermi is TMU- and ROP-less. NVidia's traded them for lots of INT ALUs (512, an insane number by any standard, let's make no mistake). I don't know how many 32-bit ALU cycles on GF100 a bog standard 8-bit texture result would take through LOD/bias/addressing/decompression/filtering, so it's hard to say whether GF100 has ~2x GT200's texture rate, or 4x, etc.

But what I can say is that in dumping TMUs NVidia's effectively spent more die space on texturing (non-dedicated units) and seemingly justified this by double-precision acceleration and compute-type INT operations (neither of which are graphics - well there'll be increasing amounts of compute in graphics as time goes by). Now you could argue that the 80 TUs in RV870 show the way, NVidia doesn't need to increase peak texturing rate. Fair enough. Intel, by comparison, dumped a relatively tiny unit (rasteriser), so the effective overhead on die is small. Rasterisation rate in Larrabee, e.g. 16 colour/64 Z per clock, is hardly taxing.

So in spending so much die space on integer ops, NVidia's at a relative disadvantage in comparison with the capability of Larrabee.

Secondly, each core can only support a single kernel. This makes for much more coarsely-grained task-parallelism than seen in Larrabee. NVidia may have compensated by more efficient branching - but I'm doubtful as there's no mention so far of anything like DWF. The tweak in predicate handling is only catching up with ATI (which is stall-less) - while Intel has branch prediction too.

Cache. Well clearly there aint enough of it if there's no ROPs. Would guess that NVidia's still doing colour/Z compression type stuff (and the atomics are effectively providing a portion of fixed-function ROP capability, too), so some kind of back end processing for render targets, just not full ROPs.

Finally, there's that scheduler from hell, chewing through die space like there's no tomorrow.

That's my gut feel (and I'm only talking about D3D graphics performance). Of course if it does have TMUs and ROPs, then I retract a good bunch of my comments. ROPs, theoretically, should have disappeared first (compute-light) - but it's pretty much impossible to discern anything about them from the die shot. Separately, I just can't see enough stuff per core for TMUs - and that's what I'm hinging my opinion on.

But from what we do know, Intel aims at GTX285 performance, and so far it looks like Fermi is going to be considerably faster than GTX285, hence probably faster than Larrabee.
With full C++ support, I don't really think there's much that x86 can do, that Fermi can't.
Broadly, anything Larrabee can do, GF100 can do too in terms of programmability. What gives me pause is that on Larrabee multiple kernels per core can work in producer/consumer fashion. In GF100 producer/consumer requires either transmitting data amongst cores or using branching within a single kernel to simulate multiple kernels. This latter technique is not efficient. (Hardware DWF would undo this argument.)

I think that the more finely-grained parallelism in Larrabee (not to mention that the SIMDs are actually narrower in logical terms) will allow Intel to tune data-flows more aggressively. The large cache will also make things more graceful under strain.

Against this I see three key advantages for NVidia: it's graphics, stupid; CUDA has been a brilliant learning curve in applied throughput computing (enabling a deep understanding for the construction of a new ISA); certain corners of the architecture (memory controllers, shared memory) are home turf.

Anyway, I'm dead excited. This is more radicalism than I was hoping for from NVidia. Just a bit concerned that the compute density's not so hot.

Jawed
 
Back
Top