NVIDIA Fermi: Architecture discussion

Silus · Nov 18, 2009

Tchock said:
I highly doubt the presence of any other non GF100 chip for sale in Q1 if it hasn't taped out by now, even if it's a miracle A1 tapeout (nV starts at A1 not 0 right?).

So what Wavey said previously seems to be right. The memory chip itself is delaying the Tesla SKU of Fermi, and how well the chip is going on now, is still pretty much an enigma. Make no mistakes though, the probability of the highend Geforce (wastage and all) not using the Fermi Tesla chip is nearly 0%.

That certainly isn't what I took from Bill Daily's words. Can ECC simply be switched off, without removing the transistors behind it ?
DP capabilities can obviously be trimmed down, by removing some of the Stream Processors, but why would Bill Daily mention this, if the full fledged Fermi chip was indeed powering the high-end GeForce too ?

Still, I don't disregard that possibility, so it's wait and see I guess.

MfA · Nov 18, 2009

Jawed said:
I have my doubts it's massive. The ALU design appears to be fp32 + int32. Two of these, i.e. 2xfp32 + 2x int32 appear to satisfy the bulk of the DP implementation (i.e. will compute one MAD).

So? It can't get full throughput through the fp32 and int32 pipelines when not using DP ... you could just as easily say they did it like that because they needed the extra multiplier hardware for DP anyway. Seeing as how little part of the die is taking up by ALUs though, I doubt it matters in the big scheme of things. They'd have to get the area ratio of ALUs way up first.

So the real cost of ECC is the logic to compute the checksums, signal problems and perform corrections. I don't remember any specification of its strength and I don't know how much these things cost.

IMO if patents don't get in the way ATI will simply follow them next-gen, ECC on caches is only a ~10% area overhead and the per block in memory stored ECC codes for the DRAM (which I'm pretty sure is what they are doing now) is pretty much gratis except for the reduced bandwidth.

Silus · Nov 18, 2009

In here, some of the points being discussed above are mentioned:

http://www.brightsideofnews.com/new...mi-is-less-powerful-than-geforce-gtx-285.aspx

In update #2, where Theo seems to have talked with Mr. Andy Keane, General Manager of Tesla Business and Mr. Andrew Humber, Senior PR Manager for Tesla products and these came up:

Memory vendor is providing specific ECC version of GDDR5 memory: ECC GDDR5 SDRAM
ECC is enabled from both the GPU side and Memory side, there are significant performance penalties, hence the GFLOPS number is significantly lower than on Quadro / GeForce cards.
ECC will be disabled on GeForce cards and most likely on Quadro cards
The capacitors used are of highest quality
Power regulation is completely different and optimized for usage in Rack systems - you can use either a single 8-pin or dual 6-pin connectors
Multiple fault-protection
DVI was brought in on demand from the customers to reduce costs
Larger thermal exhaust than Quadro/GeForce to reduce the thermal load
Tesla cGPUs differ from GeForce with activated transistors that significantly increase the sustained performance, rather than burst mode.

The third and last points certainly give the same hint that Bill Daily gave before. That GPUs used for Tesla and GeForce will differ in terms of features enabled. So the question is if this "enabling" can be done without removing actual transistors.

compres · Nov 18, 2009

ChrisRay said:
Nvidia sent this to me. Says its actually real.

http://twitpic.com/pyhdf

Why do we have to wait so long to see this?

Silus · Nov 18, 2009

Er...In my last post, I obviously meant "So the question is if "disabling" can be done without removing actual transistors."

Seems I can't edit yet. Must be because I'm new here

fellix · Nov 18, 2009

Probably the INT32 ALUs will be throttled/disabled by -- let's say -- a factor of four, for the GF/Quadro SKUs?

Jawed · Nov 18, 2009

MfA said:
So? It can't get full throughput through the fp32 and int32 pipelines when not using DP

No - typically for NVidia the register file bandwidth just isn't there.

... you could just as easily say they did it like that because they needed the extra multiplier hardware for DP anyway.

I partly agree - it's a question of the balance required for INT multiplication. INT multiplication seems to be a troublesome bottleneck in earlier GPUs. If INT multiplication stayed in the special function unit, then it would be "too slow". NVidia's only choice then is to put it into the main pipe.

So, in the end, NVidia's gained INT and DP capability through the addition of the INT32 unit and probably a super-wide adder for subnormals. The latter is, arguably, the only bit that's DP-specific. It seems to me similarly costly as the DP-specific-overhead in RV870.

Because compute is part of graphics now, I think increased INT capability is justifiable for graphics, particularly as bytewise addressing is part of DirectCompute - 24-bit arithmetic isn't enough to address the largest resources that D3D11 supports, bytewise.

Seeing as how little part of the die is taking up by ALUs though, I doubt it matters in the big scheme of things. They'd have to get the area ratio of ALUs way up first.

The irony being that this architecture is meant to scale over the next 3-4 years. And the one thing that will definitely go up is the proportion of die taken by ALUs, since memory scaling hasn't got much breathing room. I can imagine a 512-bit variant, but not more.

IMO if patents don't get in the way ATI will simply follow them next-gen, ECC on caches is only a ~10% area overhead and the per block in memory stored ECC codes for the DRAM (which I'm pretty sure is what they are doing now) is pretty much gratis except for the reduced bandwidth.

I expect a right marketing battle over ECC will ensue. I strongly believe it's a white elephant as there is still no public evidence that soft errors aren't the result of faulty hardware in GPU based systems.

Jawed

SlmDnk · Nov 18, 2009

nVidia’s Fermi at SC09 | VizWorld.com

YouTube - Demonstration of NVidia Fermi (Fermi vs. GT 200)

Richard · Nov 18, 2009

ChrisRay said:
Nvidia sent this to me. Says its actually real.

http://twitpic.com/pyhdf

This shows two things:

a) It will run DX11 benchmarks.
b) It's big.

So unless nVidia thought the community had any doubts of the above, why release this while conveniently leaving out the fps. Seems a bit... desperate is too strong a word but definitely awkward. I'd expect a similar stunt by PowerVR/SIS back from the dead or even LRB which do have something to prove.

Does nVidia believe it has something to prove?

Sxotty · Nov 18, 2009

Richard said:
This shows two things:

a) It will run DX11 benchmarks.
b) It's big.

So unless nVidia thought the community had any doubts of the above, why release this while conveniently leaving out the fps. Seems a bit... desperate is too strong a word but definitely awkward. I'd expect a similar stunt by PowerVR/SIS back from the dead or even LRB which do have something to prove.

Does nVidia believe it has something to prove?

While that is true there are still people who say it isn't working, doesn't exist, cannot run code yet. Thus it addresses part a) as mentioned in your post. An elephant is big, but cannot run DX11 benchmarks.

w0mbat · Nov 18, 2009

Sxotty said:
While that is true there are still people who say it isn't working, doesn't exist, cannot run code yet. Thus it addresses part a) as mentioned in your post. An elephant is big, but cannot run DX11 benchmarks.

Well, since we know how NV copes with this stuff (eg fake Fermi board) i wouldnt take this as proof. Maybe they got a HD5870 under the table =D

Vincent · Nov 18, 2009

w0mbat said:
Well, since we know how NV copes with this stuff (eg fake Fermi board) i wouldnt take this as proof. Maybe they got a HD5870 under the table =D

Another Possibility :

Nvidia may have two distinctive ASICs with/without DP Support. ( Fermi and Geforce )

fellix · Nov 18, 2009

Wow!
Using DP arithmetics to compare GT200 vs. Fermi -- that's more like a case of showing how much GT200 lacks doubles throughput.

Tchock · Nov 18, 2009

Silus said:
That certainly isn't what I took from Bill Daily's words. Can ECC simply be switched off, without removing the transistors behind it ?
DP capabilities can obviously be trimmed down, by removing some of the Stream Processors, but why would Bill Daily mention this, if the full fledged Fermi chip was indeed powering the high-end GeForce too ?

Still, I don't disregard that possibility, so it's wait and see I guess.

It's as simple as laser-cutting or eFuse blowing. They did it on the Quadros vs Geforce, this should be similar.

ECC should be disabled even on the Tesla, otherwise you wouldn't need to advertise ECC on/off available memory figures.

Was talking to Farhan that day and he puts nV in good faith that DP FLOPs won't magically vanish. I'm more skeptical.

If nVidia had something else to show to the gamer community other than the same GF100, why is it not being shown? Knowing them it would have come first. Instead they're showing corner case advantages vs GT200. I know it's part of Sun Tzu and playing the cards you play best and leave the rest to imagination, but ATI plays every card they have in hand. One exudes confidence the other doesn't.

p/s: This is kinda like Barcelona vs Harpertown the more I look at it. At least GPU cycles are faster. Hmm.

Vincent · Nov 18, 2009

fellix said:
Wow!
Using DP arithmetics to compare GT200 vs. Fermi -- that's more like a case of showing how much GT200 lacks doubles throughput.

The customer who bought the Tesla C10XX

trinibwoy · Nov 18, 2009

fellix said:
Wow!
Using DP arithmetics to compare GT200 vs. Fermi -- that's more like a case of showing how much GT200 lacks doubles throughput.

What's wrong with that given the target market? There are people who are currently using Tesla's weak DP throughput today who would be interested in the comparison.

Silus · Nov 18, 2009

Tchock said:
It's as simple as laser-cutting or eFuse blowing. They did it on the Quadros vs Geforce, this should be similar.

ECC should be disabled even on the Tesla, otherwise you wouldn't need to advertise ECC on/off available memory figures.

Was talking to Farhan that day and he puts nV in good faith that DP FLOPs won't magically vanish. I'm more skeptical.

If nVidia had something else to show to the gamer community other than the same GF100, why is it not being shown? Knowing them it would have come first. Instead they're showing corner case advantages vs GT200. I know it's part of Sun Tzu and playing the cards you play best and leave the rest to imagination, but ATI plays every card they have in hand. One exudes confidence the other doesn't.

p/s: This is kinda like Barcelona vs Harpertown the more I look at it. At least GPU cycles are faster. Hmm.

Don't really see it as a problem. You also saw almost nothing about G80, until about 2-3 weeks from the actual launch and look how that turned out. They obviously have A2 chips but don't want to show the gaming bits yet, since A3 will be the chip that will...er...ship

Showing HPC specific tasks running on Fermi, makes sense since that's the new market that Fermi is trying to get into in full force and it's highly profitable.

All the fuss (most of it fueled by "articles" written by you know who) that NVIDIA was leaving the high-end market - because of supply constraints, that affected AMD too, but NVIDIA was the only one in trouble in those "articles" of course - and the fact that NVIDIA only showed Fermi "HPC bits" was further indication of that, was absurd. Fermi was designed with much more than gaming in mind, but it's definitely a gaming chip aswell.

digitalwanderer · Nov 18, 2009

Y'all really believe that's a working card just because nVidia says it is?

Oh man! Unless they show the card, monitor, and the wire connecting them clearly in the shot I'm gonna be extremely skeptical; and even if they do that I'll check carefully to make sure it's not a photoshopped screencap placed on a set shot.

Groo The Wanderer · Nov 18, 2009

Arun said:
TSMC's explanation is simple and reasonable, just read between the lines of their latest financial CC: because of high customer (i.e. AMD/NVIDIA) demand, they tried ramping the process too fast and therefore skipped on *some* of the metrology to save time. This is classical upper management pressure on engineers, telling them they need to hit a goal even though it's not realistic and it coming back to bite them bigtime. No overly complicated theories needed.

Umm, the people I talk to say they test about once every hour or four (depends on wafer rates on the tool more than anything else). Now if TSMC took metrology to 1/10 of what it was, they should have caught it in a day at most.

On top of that, you can't ramp a process without massive metrology input and feedback. If you are trying to up yields from crap to less crap, you NEED that feedback. Even if management tells the engineers to save a very small bit of time by skipping that step, all they will do is make sure process improvement goes from science to guesswork.

To miss it for multiple months is not plausible. To not test is not plausible. To lessen tests to a degree that this would go undetected is also not plausible. If you have a good explanation for how you ramp a process and new equipment without feedback until the chip is done, let me know, we can make a lot of money on it.

Arun said:
If only AMD was allocated to those not-properly-tested chambers/tools (which I massively doubt), then that'd get a fair bit more suspicious, but even then it'd seem ridiculous to me because TSMC incurred large losses because of this problem. There's no way they did this voluntarily.

[conspiracy hat on] One scenario could be that someone will lose less money by paying TSMC to spike yields on the whole process than they would by their competitor eating them alive in the market.[conspiracy hat off] I am not saying this is happening, nor am I saying it is only affecting ATI, I am just saying that something is really really wrong. The explanations don't add up, or even come close.

Now if they had said, "we are ramping new lines, and during that, XYZ", that would explain why output is not going up, but not why it went DOWN from what it was. Please note I am talking yield as a percentage of die candidates, but overall number of dies coming off the line. That should not go down at all, ever, or at least not down a lot. It did.

Arun said:
You are making a massive conceptual mistake here. What matters is not the percentage of users that need some functionality, it is the percentage of gross profit that derives from it. What you need to compare is the total gross profit you'd get from a gaming-only chip (via higher gross margins) versus the total gross profit from a gaming+HPC chip (via lower gaming gross margins but extra HPC gross profits).

I was just referring to the graphics portion. I agree with what you say on the overall picture, for now. It will be a different game in ~6 months though, but I can't say why yet.

Arun said:
Based on very realistic Fermi HPC revenue predictions, I think from that (correct) point of view, GF100's area efficiency is noticeably *higher* than if it was a gaming-only chip. On the other hand, its derivatives would be noticeably less area efficient if they couldn't remove the functionality but they've indicated they could at least remove half-DP, which is probably the most important single element. GF100 would still be very slightly less power efficient for gaming, but that doesn't look like a big deal to me.

Lets see how they do that. It is going to be funny to watch them spin that one. "It is _THE_ most important thing since the invention of knee pads" said one NV spinner when asked about Fermi, "but it is only important in chips measuring over 500mm^2 because of technical reasons that are 'beyond our scientific understanding'*". Spin till ya puke.

-Charlie

* They actually used that on me when they were trying to convince me that the bad bumps were not catchable at an earlier stage. Really. The other five process/packaging people I talked too all gave me an answer that was within the understanding of then current science, and all five had the same answer too.

rpg.314 · Nov 18, 2009

Silus said:
[snip]
[*]DVI was brought in on demand from the customers to reduce costs

Can anybody give a quarter sensible reason how putting a DVI port on a tesla will help reduce costs for supercomputers when these babies cost ~$3K a pop. And it's not like they undercut Quadros on price either

NVIDIA Fermi: Architecture discussion

Silus

MfA

Silus

compres

Silus

fellix

Jawed

SlmDnk

Richard

Mord's imaginary friend

Sxotty

w0mbat

Vincent

fellix

Tchock

Vincent

trinibwoy

Meh

Silus

digitalwanderer

Groo The Wanderer

rpg.314

Similar threads