NVIDIA GF100 & Friends speculation

You really should watch the video. Huang didn't blame 40nm.

"The parasitic, uhm, characterization from our foundries and the tools and the reality are simply not related. At all. We found that a major breakdown between the models, the tools and reality"

i.o.w. TSMC, your 40nm sucks.

The part about no communication between the departments is moot if they follow the design rules properly. Especially since they already had 40nm parts out and should know bloody well what would happen with their GF100 design.
 
"The parasitic, uhm, characterization from our foundries and the tools and the reality are simply not related. At all. We found that a major breakdown between the models, the tools and reality"

i.o.w. TSMC, your 40nm sucks.
I think by now we all know that 40nm in TSMC really sucks .

The problem is 2 dimensional , TSMC screwed 40nm , however Nvidia still has the chance to correct any problems and proceed with production normally , but Nvidia screwed management , so the chip bought the farm !
 
"The parasitic, uhm, characterization from our foundries and the tools and the reality are simply not related. At all. We found that a major breakdown between the models, the tools and reality"

i.o.w. TSMC, your 40nm sucks.

The part about no communication between the departments is moot if they follow the design rules properly. Especially since they already had 40nm parts out and should know bloody well what would happen with their GF100 design.
First of all, based on his description and my understanding of this kind of thing, it is not 40nm specific and could just as well have happened to a lesser extent on a 65nm design if they had used a similar approach. And errr, this has nothing to do with design rules and everything to do with characterization. People really need to stop throwing "design rules" around as if that didn't mean anything and could be used in any context whatsoever.

This has more to do with TSMC's PDK *and* Third Party Tools not being accurate for such a ridiculously large amount wires next to each other. And that's probably not because TSMC or the tool provider were negligent; it's because simulation is necessarily always an approximation, it's a model and there are necessary simplifications compared to reality for performance reasons (and implementation difficulty), so in extreme cases such as theirs it can be a big problem. The responsibility when you're doing something so unusual lies on the chip designer, i.e. NVIDIA.

Jen-Hsun clearly did not deny that responsibility. The solution he highlights is entirely internal to NVIDIA. For once that he actually fesses up to a screwup publicly, you're going to complain that he blames TSMC which he obviously doesn't? That doesn't make any sense.
 
First of all, based on his description and my understanding of this kind of thing, it is not 40nm specific and could just as well have happened to a lesser extent on a 65nm design if they had used a similar approach. And errr, this has nothing to do with design rules and everything to do with characterization. People really need to stop throwing "design rules" around as if that didn't mean anything and could be used in any context whatsoever.

This has more to do with TSMC's PDK *and* Third Party Tools not being accurate for such a ridiculously large amount wires next to each other. And that's probably not because TSMC or the tool provider were negligent; it's because simulation is necessarily always an approximation, it's a model and there are necessary simplifications compared to reality for performance reasons (and implementation difficulty), so in extreme cases such as theirs it can be a big problem. The responsibility when you're doing something so unusual lies on the chip designer, i.e. NVIDIA.

Jen-Hsun clearly did not deny that responsibility. The solution he highlights is entirely internal to NVIDIA. For once that he actually fesses up to a screwup publicly, you're going to complain that he blames TSMC which he obviously doesn't? That doesn't make any sense.

Well he did say that TSMC's (although it didn't refer to them by name) characterization and tools had no relevance at all to reality, which is probably exaggerating.

But essentially, I agree: the feeling I got from this interview is "we screwed up, because of poor management".
 
Well he did say that TSMC's (although it didn't refer to them by name) characterization and tools had no relevance at all to reality, which is probably exaggerating.
Sure, he said they had no relevance to reality *in their case*, and he named both 'foundries' and 'tools' separately, clearly implying a third party tool vendor (e.g. Synopsys) failed to find any problem. This is not a TSMC issue, even though obviously if TSMC or the tool vendor had made a better model (probably beyond what could be expected of them for a standard solution) that could have handled even this extreme case properly, it couldn't have occurred in the first place. I think Jen-Hsun's wording clearly implies he's not placing the blame on them though.
 
I think Jen-Hsun's wording implies he's not placing the blame on them though.

But now you're making it sound like he's saying that their engineers know diddly-squat about leakage and interference, surely he jests.
 
But now you're making it sound like he's saying that their engineers know diddly-squat about leakage and interference, surely he jests.
He clearly said the problem was the communication with the physics-centric group. It's perfectly normal for most hardware engineers at a company not to know much more about parasitics/interference than theory (which obviously isn't enough to understand specific problems on its own) and what the tools tell them. There's no reason for every engineer to second-guess the tools. But at the same time, it is important for the specialised engineers with deep process and physics knowledge to check the more complex parts to understand and fix that kind of complex issue. That's where they badly screwed up, and it doesn't seem to me like he was trying to shift the blame for it.
 
I think Jen-Hsun's wording clearly implies he's not placing the blame on them though.
Actually both sides are right - JHH accepted blame on Nvidia behalf, but also didnt missed the opportunity to stab at TSMC and tools provider. It didnt surprise me, however what was new and refreshing, JHH sounded almost... sincere and honest :oops: Now I've seen everything ;) Nvidia pretty much never accepts blame, remember how during bumpgate scandal they blamed manufacturer, OEMs, and even customers? I guess Fermi troubles humbled even a guy like JHH with a massive ego.
 
I appreciate all the interesting opinions, conclusions and technological discussion, but is it sure, that he speaks truth and these statements are unbiased (this time)?
 
I appreciate all the interesting opinions, conclusions and technological discussion, but is it sure, that he speaks truth and these statements are unbiased (this time)?
Well they released on A3, not A2, so presumably it wasn't the *only* problem. Given that I can promise you the demo at GTC09 was on a real Fermi, I'd tend to believe the chip was still usable enough to debug fully before A2 and they just didn't manage to fix everything before A3 - but certainly it might have been a bigger limitation than I realise.

Of course, it's rather funny we are all debating this detail when Jen-Hsun dropped a huge architectural bombshell about GF100 that has never been publicly pointed out in that video, but everyone's just going to assume he can't count instead. How ironic! :D
 
Of course, it's rather funny we are all debating this detail when Jen-Hsun dropped a huge architectural bombshell about GF100 that has never been publicly pointed out in that video, but everyone's just going to assume he can't count instead. How ironic! :D

GF100 A1 only had 256 CC :runaway: (@0:37 "Clusters of SMs, 256 Processors")
 
Well they released on A3, not A2, so presumably it wasn't the *only* problem. Given that I can promise you the demo at GTC09 was on a real Fermi, I'd tend to believe the chip was still usable enough to debug fully before A2 and they just didn't manage to fix everything before A3 - but certainly it might have been a bigger limitation than I realise.

Of course, it's rather funny we are all debating this detail when Jen-Hsun dropped a huge architectural bombshell about GF100 that has never been publicly pointed out in that video, but everyone's just going to assume he can't count instead. How ironic! :D

GF100 A1 only had 256 CC :runaway: (@0:37 "Clusters of SMs, 256 Processors")

Which is more likely then?
1) Jen-Hsun made a boo-boo
2) GF100 was originally designed to have only 8 SMs?

edit: 10 doesn't go evenly into 256
 
Last edited by a moderator:
it will be retroactively applied to all products. I thought it was very rare for nv to make mistakes? :)
 
Well, after NV30 they said how many mistakes they did, what the learnt from it... NV40 will be perfect etc. At the time of G71 they said something about great efficiency of their products and that ATi have to learn much... At the launch of G80, they said, that previous architectures were wrong and that they learnt from their mistakes. At 2009, Fermi was marketed as a silicium incarnation of Christ, now it's a bugged low-efficiency chip, whose successor will be up-to 8-times more fragrant... Is it some kind of marketing, or are they serious?
 
This may come as a surprise, but the set of target apps is a moving goalpost. What's efficient for one set of workload isn't necessarily efficient for workloads that have yet to be created. G71 is a great chip for the DX9 workloads at the time. It's hard for it to run DX11 apps though. Similarly, G80 is great for DX10, but would suffer under the more geometry-heavy world of DX11. And that doesn't even count all the CUDA-related features and workloads.

What's more is that different things scale differently with different process nodes. So, eg, logic gates used to be expensive and wires ~free, so that pointed to one particular architecture as being efficient, whereas in a different process node, logic is cheap and wires expensive, and a different architecture is needed to be efficient there.
 
This has more to do with TSMC's PDK *and* Third Party Tools not being accurate for such a ridiculously large amount wires next to each other. And that's probably not because TSMC or the tool provider were negligent; it's because simulation is necessarily always an approximation, it's a model and there are necessary simplifications compared to reality for performance reasons (and implementation difficulty), so in extreme cases such as theirs it can be a big problem.

What tools are used to measure such parasitic couplings? Is this mostly by computing accurate capacitive coupling between nets with a tool like Magma QuickCap or Synopsys Raphael NXT? Or is this a different kind of problem?
 
He clearly said the problem was the communication with the physics-centric group. It's perfectly normal for most hardware engineers at a company not to know much more about parasitics/interference than theory (which obviously isn't enough to understand specific problems on its own) and what the tools tell them.

This really isn't true. Pretty much everyone should be aware of the CLR of the wiring stack as it effects everything from architecture to layout. If they are saying they really screwed up this modelling and tool configurations and never did any back of the envelope calculations (math is always good), then that is a monumental screw up of fairly staggering proportions. I know that everyplace I've ever worked that huge attention is paid to wires/interconnect from the beginning of the architecture flow. Things are figured out very early on wrt half/full/etc shielding, which wires to interleave, etc. Its been pretty fundamental for multiple process generations now.
 
Back
Top