Nvidia GT300 core: Speculation

Status
Not open for further replies.
Because FPGA's are in that respect similar to RAMs? Highly repetitive structures that lend themselves extremely well to repair with redundancy?

Yes, but so do GPUs. The only difference is that their highly repetitive structures contain a lot of logic and so forth.

If you have a faulty SM, you can just disable that TPC.

David
 
Is there any significance in the $270M of deferred earnings from half price Windows 7 software that has been booked into Microsofts future earnings? Surely if even 10% of these people do a significant upgrade with Windows 7 that it could mean a bit more than a blip in hardware sales?
 
Yes, but so do GPUs. The only difference is that their highly repetitive structures contain a lot of logic and so forth.
I think though gpus have way more non-repetitive structures than fgpas, so protecting against failures in all areas of the chip might be more difficult.
And btw do we even have rumoured yield rates for these fpgas or die size estimations? Since the 40nm fpgas are all premium parts for now even lower-than-expected yields might be acceptable.
 
Yes, but so do GPUs. The only difference is that their highly repetitive structures contain a lot of logic and so forth.

If you have a faulty SM, you can just disable that TPC.
Oh, come on.

A high end GT200 has 10 TPC's and a lot of non-repetitive units.
A high end Stratix IV has 650000 LE's, 1360 multipliers and very little non-repetitive units.

Yet the only difference you see is what kind of logic the units contain?

The amount of repetitive units (at least 2 orders magnitude difference) and Area(non-rep)/Area(rep) account for something.

You don't need a spreadsheet to see that the numbers work out completely different in a very low yield situation.
 
I didn't claim otherwise. But eventually, the parameter you care about is timing. It's all those effects rolled up into one number.
Chalnoth wrote that you need to care about SI and cap when doing a shrink. No, you don't, not explicitly: it's a given that those will be fixed as part of your standard flow.

Cap, no, that is just an input. SI sometimes can have a floorplan impact though.

Synth may be a single threaded process, but it's not as if you're going to do run 'compile' on your top level and have 1 process churn through the full chip.

Depends on the size of the blocks you are running synth on. Lots of trade-offs to be found but inevitably, you'll end up with one block that does take significant time.

Yes, for analog blocks and custom circuits. For vanilla standard cell RTL design that run below, say, 1GHz? If you stay within the same process class, you never see frequency reductions when shrinking in 99% of cases.

Well, if you aren't trying to do anything fancy, sure its not bad, but generally you are pushing somethings, be it cycle time, gates per cycles, etc. None of the GPU vendors are doing vanilla standard cell RTL designs. They are highly architected and micro-architected complex pipelines. Its not like there are doing some FPGA->ASIC conversion. Some of the things they are doing are very very complex, like their memory controllers for instance.

You work in a different environment, where 3GHz+ clocks are normal and heavy pessimism in process characterization is not acceptable. This is not where pretty much all fabless companies live. They get to live with the libraries that are given to them by fabs that prefer to make their process look a bit slower up front instead of having to explain later why silicon doesn't perform at speed in all corners.

Funny, all the stories I hear from various sources is they still have to end up explaining why the silicon doesn't perform at all corners.

True, historically, the work I've done has been on much tighter cycle times in relation to process. But it isn't like the graphics companies aren't pushing cycles times either. They are just pushing them with 2-2.5x the number of gates per cycle. While that does give some advantages as non-characterized/random impact of a gate here or there having less impact it does happen. And for Nvidia, they have in the past pushed frequency quite hard for the processes they are on with things like near 2 Ghz FMA pipes in 65nm and such.
 
Oh, come on.

A high end GT200 has 10 TPC's and a lot of non-repetitive units.
A high end Stratix IV has 650000 LE's, 1360 multipliers and very little non-repetitive units.

Yet the only difference you see is what kind of logic the units contain?

NVDA has a long history of taking parts with defects and basically down binning them. They can certainly do it should the need arise.

I'm fully aware of the difference in scope between an FPGA and a GPU.

Those points are really orthogonal though. What we were talking about was the root cause of delays on 40nm. My observation was that Altera was doing fine, while NV was not.

A reasonable conclusion is that the difficulties are caused by some combination of:
1. Different process tech options used
2. More complicated design of the GPU


The amount of repetitive units (at least 2 orders magnitude difference) and Area(non-rep)/Area(rep) account for something.

You don't need a spreadsheet to see that the numbers work out completely different in a very low yield situation.

What areas in a GPU cannot be disabled when there's a flaw?

Guess: triangle setup engine, global control scheduler

What areas in a GPU can be disabled when there's a flaw?

Guess: TPC, ROP, TMU

I'm not saying that there's the same degree of redundancy in a GPU and an FPGA, but that I think GPUs have quite a bit of resilience to errors when you down bin them.

DK
 
For a start you can use an arbitrary amount of redundancy within the structure of the FPGA, whereas a GPU's wildly varying types of functional units don't all have the scale to easily offer such flexibility.

Do you know the redundancy built into Altera's designs?

Fundamentally because FPGAs are a bound design space dependent on configuration for functionality, as long as the configuration stream is working, you can have programmatic redundancy of almost the entire design within a certain frequency margin which tends to have large envelopes. Therefore assuming you implement multiple redundancy in the load flow past the load controller should should be able to achieve near 100% salvage. Things like FPGAs and SRAMs are only valid of a processes ability in so far as it tells you it isn't completely broken.
 
That's even optimistic. To 'validate' software is hard, to validate hardware is harder ;)

Actually, it is easier to validate hardware. This is for a number of reasons not the least of which is there hardware is designed with a very limited language set. Software tends to have languages that weren't designed with validation in mind at all and most programmers have very little idea what they are coding is actually doing. Hardware designers on the other hand tend to have a much closer understanding of what they are designing.

Its easy to show numerous examples, for hardware things such as linked-lists and content-addressable-arrays (CAMs) are fairly exotic and limited. For software these types of structures are considered basic ingredients and the actual structured used can be much more complex.

The effective calling conventions for hardware are likewise simple and generally easy to formally verify where as the same cannot be said of software with such complex interactions such as inheritance and the such. Its amazing what happens when you ask a software engineer what a given variable actually is. A hardware engineer can generally tell you immediately.

And this concludes my rant on why OOP is just plain bad and wrong and the root of all evil. Even the experts don't know what it is doing half the time.
 
What areas in a GPU can be disabled when there's a flaw?
Guess: TPC, ROP, TMU
It's not that simple: a TMU is assigned to a TPC. A ROP is assigned to a memory controller. It's possible that there is a way to mux ROPs to different MC's for this reason, but in recent products, there haven't been popular products with MC's disabled anyway, so it doesn't really matter.

So your redundancy lies only in the TPCs, for which there are 10 at most. There haven't been many products where more than 1 has been disabled either.

How much area is in the TPC's? 60%? So 40% of your die is already exposed to single defects and the rest can sustain 1 at most (unless two fall within the same TPC, of course. And I'm conveniently ignoring RAMs. ;))

That's already much better than no redundancy at all and will help big time when yields are good, but it helps little if each die sees 2 or 3 or 4 defects.

The granularity of an FPGA makes it easy to get decent yields with 10 defects or more. And the final good die number is not a linear function of the number of defects either.

It's a totally different ball game.
 
It's not that simple: a TMU is assigned to a TPC. A ROP is assigned to a memory controller. It's possible that there is a way to mux ROPs to different MC's for this reason, but in recent products, there haven't been popular products with MC's disabled anyway, so it doesn't really matter.
G80, GT200 and GT200b have featured products with one 64-bit channel disabled, including the corresponding quad ROP. The initial (first released) drop down SKU products based on such chip also had one or two TCP's disabled as well; these are generally the volume SKU's as well.

So your redundancy lies only in the TPCs, for which there are 10 at most. There haven't been many products where more than 1 has been disabled either.
8800 GTS (G80) disabled two TCP's. GT200's initial (65nm) version of GTX 260 also disabled two.
 
It's not that simple: a TMU is assigned to a TPC. A ROP is assigned to a memory controller. It's possible that there is a way to mux ROPs to different MC's for this reason, but in recent products, there haven't been popular products with MC's disabled anyway, so it doesn't really matter.

RE: disabled memory controller/mapping out a ROP

Are you sure? When I look at the GTX 260 and 250, it sure looks like it has one of the 64b memory channels disabled....

I don't think that NV would tape out a separate part with 448b memory interface vs. 512b interface.

If you have a bad ROP, just make a narrower memory interface. Ditto for a defect in certain parts of the memory controller.

RE: mapping out TMU.

If your TMU is dead, then map out the TPC!

So your redundancy lies only in the TPCs, for which there are 10 at most. There haven't been many products where more than 1 has been disabled either.

I think we disagree here. By my calculation, the ROPs, TMUs and SMs account for 72% of the area of the die and are all redundant.

I'm assuming the PHYs, memory controllers, thread scheduler and triangle/setup engine are not redundant.

However, I think your next point is more significant.

The granularity of an FPGA makes it easy to get decent yields with 10 defects or more. And the final good die number is not a linear function of the number of defects either.

It's a totally different ball game.

Agreed - being able to sustain 10 defects/die is a huge advantage and does make it somewhat different.

Anyway, going back to the higher level point...NV has said they expect that 25-30% of their products will be on 40nm by the end of the year. They also said that their chipsets will all stay on 55nm.

So it does seem like they are confident the problem will be solved soon.

DK
 
Actually, it is easier to validate hardware. This is for a number of reasons not the least of which is there hardware is designed with a very limited language set. Software tends to have languages that weren't designed with validation in mind at all and most programmers have very little idea what they are coding is actually doing. Hardware designers on the other hand tend to have a much closer understanding of what they are designing.
Hardware can be very difficult to validate. Consider GPUs, they are very complex state machines. Since the current results can depend on state across many different blocks, bugs can be quite tricky to track down. Sure, a hardware designer may know what they are designing, but that doesn't mean they know everything that's going on in the chip that depends on their result or that they are depending on for their result.

Also, I have seen all sorts of "bizzaro" HW bugs over the years. The advantage of SW bugs is that you can always release a new driver. HW bugs can much more difficult to resolve.
 
Its amazing what happens when you ask a software engineer what a given variable actually is. A hardware engineer can generally tell you immediately.

And this concludes my rant on why OOP is just plain bad and wrong and the root of all evil. Even the experts don't know what it is doing half the time.

Thanks for the laughs. I think you're talking to the wrong experts.
 
Are you sure? When I look at the GTX 260 and 250, it sure looks like it has one of the 64b memory channels disabled....
No, actually, you're right. It's more that I'm not entirely familiar with the products out there and I was in a hurry to look it up.

But it only changes the details of the story, not the larger lines.

What's important when it comes to redundancy is how many redundant clusters you have.

In the case of a GPU, you have two: the TPC's and the MC/ROPs. In each of them, you can correct one or two defects. That's very valuable, of course, but it pales in comparison to something like an FPGA.

Let's do some freewheeling.

The Stratix IV has 650k LE's and 1360 mults.

As a first order approximation, assume that the memory blocks have 100% yield: they have their own ram repair, so it's probably not far from the truth. (I've never seen a die shot of an FPGA, but it's fair to assume that FPGA's have a higher memory content per area than a GPU. Memory is more sensitive to defects, but it's also easier to repair. I believe that the net effect is towards higher yield overall.)

Let's also assume that mults and the LEs are grouped together and treat them as such: 1360 MLE blocks of 1 mult and ~500LE's each.

Altera is aware that bleeding edge processes can have high defect densities, so they're probably going to design it such that there are explicit redundant MLE's that cannot be used functionally. Say they decide that a 5% overhead is acceptable.

They'll group the 1360 functional MLE's in 78 clusters of 20 MLE's each and add 1 redundancy MLE to each cluster.

While, in the GPU case, you have only 2 clusters and disabling 1 unit is definitely more costly than 5% per cluster, here they have 78 clusters at a cost of only 5%. Each of those clusters is capable of working around at least 1 defect (more if all defects fall in the same MLE, also true in the GPU case)

You see, in this particular case, things are very much in favor of FPGA. So much so that I suspect that the overhead is probably lower than 5%. Even at 1%, they still have 15 redundant clusters, 7 times more than our GPU.

It's really no wonder that they claim yields are good. Anything else would be a shocking indictment of the process quality.
 
No, actually, you're right. It's more that I'm not entirely familiar with the products out there and I was in a hurry to look it up.

But it only changes the details of the story, not the larger lines.

What's important when it comes to redundancy is how many redundant clusters you have.

In the case of a GPU, you have two: the TPC's and the MC/ROPs. In each of them, you can correct one or two defects. That's very valuable, of course, but it pales in comparison to something like an FPGA.

Let's do some freewheeling.

The Stratix IV has 650k LE's and 1360 mults.

As a first order approximation, assume that the memory blocks have 100% yield: they have their own ram repair, so it's probably not far from the truth. (I've never seen a die shot of an FPGA, but it's fair to assume that FPGA's have a higher memory content per area than a GPU. Memory is more sensitive to defects, but it's also easier to repair. I believe that the net effect is towards higher yield overall.)

Let's also assume that mults and the LEs are grouped together and treat them as such: 1360 MLE blocks of 1 mult and ~500LE's each.

Altera is aware that bleeding edge processes can have high defect densities, so they're probably going to design it such that there are explicit redundant MLE's that cannot be used functionally. Say they decide that a 5% overhead is acceptable.

They'll group the 1360 functional MLE's in 78 clusters of 20 MLE's each and add 1 redundancy MLE to each cluster.

While, in the GPU case, you have only 2 clusters and disabling 1 unit is definitely more costly than 5% per cluster, here they have 78 clusters at a cost of only 5%. Each of those clusters is capable of working around at least 1 defect (more if all defects fall in the same MLE, also true in the GPU case)

You see, in this particular case, things are very much in favor of FPGA. So much so that I suspect that the overhead is probably lower than 5%. Even at 1%, they still have 15 redundant clusters, 7 times more than our GPU.

It's really no wonder that they claim yields are good. Anything else would be a shocking indictment of the process quality.

I totally agree with your reasoning here : )

DK
 
And this concludes my rant on why OOP is just plain bad and wrong and the root of all evil. Even the experts don't know what it is doing half the time.
Speaking as a professional OO programmer, in a sense you're right that we don't know what's going on a lot of the time - but that's kind of the point. A well-written software component does what it is advertised to do, and you don't need to know how it does it. If you do need to know what's going inside an object, that indicates that it hasn't been written correctly. Not knowing what is going on is thus the highly desirable goal of the exercise. :)
 
One pointer that GT300 is not ready is that all announced DX11 games have some sort of ATI/AMD sponsorship on them. This is a first for nV marketing.

When will DX11 games appear? when DX11 is available. Most will be patched up DX10.1 titles and all current news lead us to believe that therew will be DX11 hardware available with the launch of DX11 on which we can play DX11 games.
This includes a new AAA title.
 
Status
Not open for further replies.
Back
Top