NVIDIA Fermi: Architecture discussion

That's certainly something only NVIDIA can answer, but the tape out for the GeForce chip should be the one we've been hearing about. The one that's slated for Q1 2010.
The "bigger" version, the one for Tesla with all the features designed for the HPC market, is slated for May 2010.

I highly doubt the presence of any other non GF100 chip for sale in Q1 if it hasn't taped out by now, even if it's a miracle A1 tapeout (nV starts at A1 not 0 right?).

So what Wavey said previously seems to be right. The memory chip itself is delaying the Tesla SKU of Fermi, and how well the chip is going on now, is still pretty much an enigma. Make no mistakes though, the probability of the highend Geforce (wastage and all) not using the Fermi Tesla chip is nearly 0%.
 
My understanding is that Fermi uses the same die for both high-end gaming and the compute market.

Unless someone from NV (not just degustator or chrisray) clearly states that they are using separate designs for high-end gaming and compute, I will continue to believe this.

DK
 
I don't know why you felt so compelled to include me. I never said once that it wasn't. I'm not allowed to talk about Fermi



Hi Gktar

Nvidia sole purpose of creating Fermi was to make anonymous forum posters angry and mad. And everything they have done since then has been done to antagonize anonymous posters on the internet.

Nvidia uses fermi to hack device IDs to prevent AMD hardware from running AA. Fermi will represent the end of AA on AMD hardware.

DirectX 11 is useless because Fermi doesnt need it. PhysX will herald the end of AMD

Fermi doesn't actually exist. There are no Fermi cards. Just GT200 cards. But those are end of life to doncha know?

Nvidia doesn't actually have a tessellation. In fact. Fermi will do tessellation on spare CPU cores. Get extra performance by going Quad!

Nvidia is abandoning the gaming market.



*edit* Just for fun I decided if I'm gonna be quoted out of context I may as well make it more interesting.
 
Last edited by a moderator:
DegustatoR should be honest and say hes an nVidia PR boy at this point. He is always pushing for green no matter how bad they seem to be.


you can compare FLOPS from GEN TO GEN on same company cards, and get pretty good idea what the performance might be, but doesn't work too well with Fermi.

The rv870 has two times the shader capability, not two times the bandwidth only 25% more and fillrates changed a good deal too. So yeah it doesn't translate to 2 times the performance. But again, still can't compare this to what Fermi is, Fermi is a different architecture then the gt200 and that is what he is getting it. Simple enough, if we were to go oogly well hey its ironic :smile:
 
TFlops means almost nothing.
Higher FLOPS capacity isn't really going to make a card faster than another just by itself and NVIDIA already showed that in the past. Their GTX 280 had half the FLOPS capacity for Single Precision than the HD 4870, yet the GTX 280 was faster. ATI itself also proved that, when they launched the HD 5870 which is not much more than a single chip HD 4870 X2 with DX11 support. Despite more than doubling the peak FLOPs capacity, the HD 5870 is only around 50-60% faster than the HD 4870.
It's pretty important for FLOPS intensive work (e.g. matrix multiply), and NVidia is clearly trying to focus on that sort of thing with Fermi. Thing like GPU folding have only been faster on NVidia hardware because ATI had some limited features (e.g. write-private, read-public shared memory) and bad language support. Now they have OpenCL and better hardware, and as we saw in that "paper dragon" PDF by ATI, RV870 outdoes Fermi in some GPGPU features/specs.

I do think the top end Fermi will be faster in games simply because it will have 128 tex units, 50% more BW, and may have faster triangle setup if they find a way to parallelize that in CUDA code. But at any given BOM, RV870 will trounce Fermi in gaming and probably GPGPU, too.

Of course market forces are going to make you pay for performance either way, so just like NVidia was selling 448-bit boards with >400 mm2 chips for under $150 while ATI used a 128-bit 5770 at the same price point, so too will Fermi be at least somewhat compelling for the end user.
 
I would blame ATI and NV, except for the minor point that ATI yields were fine, then they tanked for no particular reason. I am not sure about NV yields, I will ask next time I get the chance. It isn't a design problem at that point. :) Do you have a good explanation as to why things going along, then suddenly tank could be an ATI or NV issue?

-Charlie

I can't and don't know what is going on exactly. Something is definitely not adding up in the whole story. On the other hand don't you think things are too delicate for TSMC at this point to speak up if they'd have their own side of the story?

I'd be more convinced that TSMC is mostly to blame when NV starts to ramp up production next year and yields also there start to fluctuate wildly per wafer.

No, from the gaming point of view, it is a lot of transistors that are not driving frame rates up. It is a huge loss from an areal efficiency point of view.

If you NEED it, you need it. If you don't, it is a millstone. I would assume that more than 99.9% of users don't NEED it.

In theory. Theoretically again if the price/performance ratio is in the right ballpark and power consumption isn't too high, then it's more a cost equasion for NV itself than something that would trouble the consumer. Assuming of course that all presuppositions (price, performance, power consumption) are on competitive levels.
 
TFlops means almost nothing.

GT200 had a lot of other advantages compared to the RV770 so back then the Flops-rating was "meaningless":

GTX285 / 4870

Pixel: 20736 <-> 12000; 1,73:1
Texel: 51840 <-> 30000; 1,73:1

Therefore back then the Flops-rating advantage of the RV770 was not enough to overcome all the other disadvantages.

With Fermi the situation is most likely different.If the base-MHz/Shader-MHz ratio is roughly the same as with the GTX285 then the base frequency of the Fermi-GPUs is ~550MHz (this number comes also from Ailuros):

Fermi / 5870

Pixel: 26400 <-> 27200; 0,97:1 [according to Ailuros Fermi has 48 ROPs]
Texel: 70400 <-> 68000; 1,04:1 [based on the rumour that Fermi has 128 TMUs]

=> Fermi will, most likely have no other advantages compared to the RV870 except bandwidth. Therefore the difference in Flops-rating can tip the scale in RV870's favour.
 
The tesselator is an absolutely tiny transistor budget compared to the kind of stuff Nvidia are claiming for Fermi, and at least the tesselator gets used for DX11.

The bigger issue with ff hw is that it can lead to pipeline balance going out of whack as they have to be sized for the worst case. So either they are a bottleneck or they are wasting area and power. Even if tesselator is very small in area per se, that doesn't mean it will be well balanced (wrt to the rest of the pipeline) too. Think back to the unification of vertex and pixel shaders, reuse of the same resources led to higher utilization.

That massive Nvidia transistor investment in things like double precision and ECC don't benefit the gaming community. If you're only interested in HPC, they are great - if you're interested in gaming, they are a big fat albatross around your neck.

The way ALU's in Fermi are setup, DP isn't a bit investment of area, as Jawed has already pointed out. I don't know about ECC. To be honest, I don't see the point of full speed denormal handling in hw. Denormals poping up in a calculation mean a serious error/uninitialized value somewhere in your code. Though it may have been done just to avoid having too pay a simd divergence penalty for handling it in sw.

Unless of course Nvidia is going to spend time and money designing a whole new chip that's effectively a cut down Fermi design with massive fundamental changes.
That's not going to happen. Period.
 
Fermi / 5870

Pixel: 26400 <-> 27200; 0,97:1 [according to Ailuros Fermi has 48 ROPs]
Texel: 70400 <-> 68000; 1,04:1 [based on the rumour that Fermi has 128 TMUs]

=> Fermi will, most likely have no other advantages compared to the RV870 except bandwidth. Therefore the difference in Flops-rating can tip the scale in RV870's favour.

I might have missed something, but shouldn't a 1.5x bandwidth (roughly) in favor of fermi lead to a 1.5x increase in texel filtering/fillrate in favor of fermi?:???:
 
=> Fermi will, most likely have no other advantages compared to the RV870 except bandwidth. Therefore the difference in Flops-rating can tip the scale in RV870's favour.

I think the larger caches, and better handling of dynamic branching will help fermi with those shaders at the very least.
 
GT200 had a lot of other advantages compared to the RV770 so back then the Flops-rating was "meaningless":

GTX285 / 4870

Pixel: 20736 <-> 12000; 1,73:1
Texel: 51840 <-> 30000; 1,73:1

Therefore back then the Flops-rating advantage of the RV770 was not enough to overcome all the other disadvantages.

With Fermi the situation is most likely different.If the base-MHz/Shader-MHz ratio is roughly the same as with the GTX285 then the base frequency of the Fermi-GPUs is ~550MHz (this number comes also from Ailuros):

Fermi / 5870

Pixel: 26400 <-> 27200; 0,97:1 [according to Ailuros Fermi has 48 ROPs]
Texel: 70400 <-> 68000; 1,04:1 [based on the rumour that Fermi has 128 TMUs]

=> Fermi will, most likely have no other advantages compared to the RV870 except bandwidth. Therefore the difference in Flops-rating can tip the scale in RV870's favour.


The ROPS (pixel fillrates) advantage for the gtx 285 didn't show up in benchmarks, *edit*well it did as long as AA wasn't active I guess

x16 AA (texel fillrates) the hd 48xx didn't have much of a performance hit even though it had a conciderable disadvantage the the gtx 285
 
The ROPS (pixel fillrates) advantage for the gtx 285 didn't show up in benchmarks, *edit*well it did as long as AA wasn't active I guess

If you want a really good example of this. Take a look at the GTX 280 verses the GTX 295 ((single GPU)) and you'll see very minor performance benefits. The extra bandwith/ROPS on the GT200 had diminishing returns past a certain point while the TMUS/Shaders showed much better scaling.

Chris
 
I talked to a bunch of semi guys, and all said they never heard of that type of error, or that it was an early bringup error, not a mid-process error. All were rather astounded that it didn't get caught immediately, metrology should have nailed it at the first check, two or three in at the worst. No one had an explanation for how this could have happened.
TSMC's explanation is simple and reasonable, just read between the lines of their latest financial CC: because of high customer (i.e. AMD/NVIDIA) demand, they tried ramping the process too fast and therefore skipped on *some* of the metrology to save time. This is classical upper management pressure on engineers, telling them they need to hit a goal even though it's not realistic and it coming back to bite them bigtime. No overly complicated theories needed.

If only AMD was allocated to those not-properly-tested chambers/tools (which I massively doubt), then that'd get a fair bit more suspicious, but even then it'd seem ridiculous to me because TSMC incurred large losses because of this problem. There's no way they did this voluntarily.

No, from the gaming point of view, it is a lot of transistors that are not driving frame rates up. It is a huge loss from an areal efficiency point of view.

If you NEED it, you need it. If you don't, it is a millstone. I would assume that more than 99.9% of users don't NEED it.
You are making a massive conceptual mistake here. What matters is not the percentage of users that need some functionality, it is the percentage of gross profit that derives from it. What you need to compare is the total gross profit you'd get from a gaming-only chip (via higher gross margins) versus the total gross profit from a gaming+HPC chip (via lower gaming gross margins but extra HPC gross profits).

Based on very realistic Fermi HPC revenue predictions, I think from that (correct) point of view, GF100's area efficiency is noticeably *higher* than if it was a gaming-only chip. On the other hand, its derivatives would be noticeably less area efficient if they couldn't remove the functionality but they've indicated they could at least remove half-DP, which is probably the most important single element. GF100 would still be very slightly less power efficient for gaming, but that doesn't look like a big deal to me.
 
Back
Top