NVIDIA Fermi: Architecture discussion

nope that had nothing to do with processor design, it was due to poor cooling design.

No, you are wrong, it is the same bump cracking problem as NV has, and could have been cured by a change in underfill, but MS seems to have chosen not to do that, and waited for a full shrink. It is not heat, it is heat cycling once again, but since NV still claims that this is beyond the science of mortal man, I can understand if you hadn't heard that.

ATI designed the ASIC/GPU, and offered to do the layout/physical side of things. MS declined and did it in house. I know several of the guys behind the 360 chips, and they are smart and capable, but I think they had little built up experience with large and hot chips like this, especially on cutting edge processes. I have been told all the details of the failure in painful and exacting details, and all I have to say is that it is thermal stress related bump cracking, not overheating/temperature related failures.

MS did it internally, and taught the industry a lesson. Come companies learned that lesson, others did not.

-Charlie
 
Before you intervened to say that those alleged rule violations where about the 65 to 55nm, there was a very small possibility that things were not done as they should have been for 40nm. After all, it's a new process and a lot of new things have to be designed with a lot of uncertainty at the beginnen.

But the fact that it was about 55nm closed the door on that. And even more so the fact that it was a shrink of a chip that actually went into production. That's because a shrink is laid out in a 65nm flow and shrunk after the fact.

I am aware of this, and that the rules dictate a few extra limitations if you are planning a shrink vs not. Let me turn the question around on you then, if it wasn't a rules skirting, can you tell me why the G200 took three spins to go from 65 -> 55? Both processes were not new, had volume products and several large chips out on them at the time, and were really known quantities.

Why three spins on what should have been a button press?

Can we at least give them the benefit of the doubt that they're not bat shit insane?

Nope, sorry.

That *if* they decide to change process rules, it's at least because they hope to get some benefit from it? Such as:
- accelerate schedule: NO. On the contrary. Changing process rules requires you to redesign your standard cells or your memory generators. It would take many, many man years to do that.
- increase speed/performance: NO. You don't need to violate process rules to trade off increased speed or power consumption at the cost of yield. It's a simple matter of asking the fab to skew the process. (If you want to know: you ask the fab to increase to decrease the N or P doping level.) No need to violate anything. In fact, this is exactly what a fab does when you order corner lots for prototype characterization.
- reduce area: NO. The potential additional area gain over a shrink is too small. And you run into the same objection as schedule because this too would require redesigning your full library.

So, please do give me a plausible reason why an irrational egomaniac executive would order his minions to violate process rules for a shrink because I can't come up with any.

An irrational egomaniac executive would ask his yes-men to cut corners in established procedures to accelerate schedules. (e.g. tape-out when certain blocks are known not to be completely verified, reduce hold margins in the fast timing corner, waive inconvenient noise numbers, etc.) But violate process rules for no benefit? No way.

It's really mind boggling that you're making up these stories without bothering to check if they actually make technical sense.

How about a simple one, the chip would not have worked on the given process, and they forced the issue. I have heard (anecdotally, so it could very well be wrong) that NV shopped G200 around because TSMC initially refused to make it on size grounds. It was only on the third try that it was accepted. If true, that is one reason.

-Charlie
 
It always seemed very likely to me that what Charlie's source meant wasn't design rule violations, but simply not following as many DFM suggestions as TSMC thought would be desirable. It's easy to see how this kind of thing can get lost in rumour mill intermediaries (or even more simply in translation...) - given what TSMC claims about 55nm, I cannot imagine that the design rules are different at all for digital logic, but I could imagine that they added some DFM rules and NVIDIA decided not to bother. If this is correct though, it would hardly be as big a deal as what Charlie makes out of it.

As I said to Silent Guy above, if not a DFM rule violation, why did G200b take three spins on a well characterized process?


There have been tidbits implying analogue & I/O are actually the most problematic parts on 40nm because of how greater variability affects them (one example: http://danielnenni.com/2009/08/05/tsmc-40nm-yield-explained/). Given that AMD had more impressive PHY results on 65/55nm than NVIDIA (RV770 PHY being barely bigger than NV's GDDR3-only PHYs for example), we could very naively assume that maybe NVIDIA's team isn't as talented (I obviously mean no disrespect) and that they're responsible for the GT215 delays (their only GDDR5 chip pre-Fermi) for example.

I heard the GDDR5 controller was broken.

-Charlie
 
G92 supports OpenGL 2.1, GT200 supports 3.0, GT21x supports 3.2

GT21x is NOT G92 based.

The only reason the G92 marketing stuff says OpenGl 2.1 is because that's what was available when the card came out. Everytime nvidia releases a driver update Taiwanese manufacturers don't go around updating web pages for older cards.

The same thing happened to NV40. When it came out the OpenGL flavour at that time was OpenGL 1.4-1.6 and that's what the websites said even though they got bumped up in spec to OpenGL 2.1



http://developer.nvidia.com/object/opengl_3_driver.html

"OpenGL 3.2 Driver Release Notes

You will need one of the following graphics cards to get access to the OpenGL 3.2 and GLSL 1.50 functionality:

Desktop

* Quadro FX 370, 570, 1700, 3700, 4600, 4700x2, 4800, 5600, 5800, Quadro VX200, Quadro CX
* GeForce 8000 series or higher; Geforce G100, GT120, 130, 220, GTS 150, GTS 250, GeForce GTX 260 and higher, any ION based products. "
 
You are right, TSMC and NV would not be that stupid, only one of them would. Especially if management was being told something else.

-Charlie

No, I am painting a picture of smart engineers and myopic management convinced of their own invulnerability. Bad and/or retaliatory management gets told what they want to hear.

So, if I am so wrong, why is ATI have simply fabulous yields on their parts? Should you answer something like "They are not, and it is TSMCs fault", can I ask how anyone with a shred of sanity would launch a chip that is almost 2x as large on the same process?

-Charlie


Management just doesn't simply overide procedures for benefits of thier clients without justifiable reasons to do so. Its that simple. And the reason's of pressure from the client isn't enough.

ATI still has yield issues, although getting better. And where do you get 2 times the size? You know Fermi's top end chip isn't 2 times the size of AMD's top end right?
 
I am aware of this, and that the rules dictate a few extra limitations if you are planning a shrink vs not. Let me turn the question around on you then, if it wasn't a rules skirting, can you tell me why the G200 took three spins to go from 65 -> 55? Both processes were not new, had volume products and several large chips out on them at the time, and were really known quantities.

Why three spins on what should have been a button press?



Nope, sorry.



How about a simple one, the chip would not have worked on the given process, and they forced the issue. I have heard (anecdotally, so it could very well be wrong) that NV shopped G200 around because TSMC initially refused to make it on size grounds. It was only on the third try that it was accepted. If true, that is one reason.

-Charlie


The A2 respin came back around a month after A1, you think they were worrying about design steps, if that was the true problem, they would be screwed and that would have been the first thing they would have looked into for the A2 spin! Come on even you stated what the time of the respins were and so either you believe in everything that you spout which has no colloration with one or the others, or its just crazy talk.
 
Management just doesn't simply overide procedures for benefits of thier clients without justifiable reasons to do so. Its that simple. And the reason's of pressure from the client isn't enough.

Yeah the theory of management being overly arrogant doesn't really fly. It's not like Jen Hsun (or any CEO) would completely ignore what his staff tells him based on a whim. It's more likely that the arrogance is coming from lower down in the chain. Though it doesn't really matter either way does it? :)

There is no way you need 42% of the company just to update the web site.

:LOL: Nice, good to see its not all dark and gloomy with you.
 
If anyone brings up pixel counting to measure the board, I swear something bad is going to happen.

One could use the SLI-bridge as a unit of measure to calculate the length of the cards in that second system pic. :LOL:
 
Why three spins on what should have been a button press?
Shrinks are a button press from the digital point of view. They are easier for other aspects, but there are still risks because you have to go again through a full place and route cycle. Whatever was marginal in the original chip can become a real issue after a shrink. So pick your poison: hold fixes, ESD, noise, PLL's, DLL's, speed path fixes, IO pads. Any of those are way more likely than your wacky theories.

How about a simple one, the chip would not have worked on the given process, and they forced the issue.
Makes no sense: the 65nm version worked. The 55nm version has the same electrical characteristics and is smaller.

Only things that came out after that were related to power distribution, all inherent to the design of GT2/21/3.
Aaah, so now it's power distribution? Which is inherently linked to the design of the chip? Well, I'm happy to hear that at least it didn't have something to do with leakage.

...the A2 revision of the 216 wasn't supposed to be the last, They have tried another spin but couldn't get the issues out and after that decided to go ahead with the A2. 215/6/8 would've been a lot more competitive on speeds if it didn't have these "issues". And it would certainly be here a lot quicker.
Two days ago, you first manage to cook up an authoritative sentence that makes no technical sense... about GT215.

You immediately follow up with:
(no I'm not one of the sources)
"yes, I don't know the first thing about the subject at hand, I'm just parroting some other goofball who's name I can't remember."

And now you come up with yet another story about yet another chip of which you know that there was a third spin?

I give up.
 
As I said to Silent Guy above, if not a DFM rule violation, why did G200b take three spins on a well characterized process?
silent_guy's first paragraph in his latest reply puts it much better than I ever could, but I just want to make sure this is a typo/brainfart: you do realize that DFM is, by definition, not a *rule* - right? DFM manuals are full of *suggestions*. If you don't follow any of them, chances are your design is going to be a disaster yield/leakage/etc.-wise and if you follow every single one of them in the most conservative way possible, chances are you'll leave some area/power on the table but get extremely good yields (assuming everything else works out too ofc).

If your scenario is really that TSMC told to NV that they really should bother following these few extra DFM points and NV says they do not believe it's worth the extra engineering effort and/or area, then that's a perfectly normal thing to happen and it's ridiculous to make a big deal out of it - if it was the sole reason why they needed 2 respins then maybe, but I'm still skeptical that's true (and while you do seem to have heard that it was a problem, you seem to phrase everything as if you concluded that by logical elimination - which, as silent_guy clearly explains, doesn't work here)

Of course, you probably know this and you simply typed 'DFM rule violation' by mistake - if so, just ignore what I said here. Better safe than sorry though...
 
Back
Top