NVIDIA GF100 & Friends speculation

It just depends on their level of confidence, they had confidence in functional first silicon (at least publicly).
When we say 'the first silicon came back from the fab and it's functional', we understand this as 'It more or less seems to do what it has to do in the lab'. It's a far cry from 'ready for production'. It's not unusual that you start to believe a A1 production silicon is possible after a week or two or three (often by de-featuring some logic with bugs), but almost inevitably some kind of must-fix bug will pop up eventually.

BTW, what is your 4 month netlist to masks based on?
My project schedules of the last 5 years. ;)

There's fabless and then there's fabless, a company which doesn't do everything in house right up to the mask files for instance can't hope to get anywhere near to NVIDA.
There aren't a lot of fabless companies left who don't do their own backend anymore. All the big ones do it and most of the smaller ones too.
 
lets wait till next month and really find out what is going on instead of speculation based on tainted articles ;)

Like xman26's posts about 99% only working as 470s? Wouldn't consider him red tainted at least..

Even if they saw after A1 that they couldn't reach the speeds they wanted and decided right away that a B1 could improve things, they'd still want to fix the logic bugs in metal first. It takes at least 4 months go from a netlist to tape-out, another 2 months until silicon and only then you can start with qualification again.

Would it make sense to start the process on B1 right away (and incoporate the things learned from A2/3 during the process), while still trying to at least improve yields with A2/A3 as a temporary solution?
 
its interesting they cut back on thier TMU's and everyone is talking about texturing performance now, since the g80 pretty much has no hit from AF and filtering, I don't think its a big deal they cut back.
Remember that G80 had double the filter units as address units. It had a lower hit from AF than G92 because G92 was faster without AF, despite being more compact than G80.

Part of tessellation is "texturing" and Fermi's L1/L2 cache system should make this more efficient. There's an interesting note in the Fermi Tuning Guide:
Yeah, I've seen that in the previews, and I've seen NVidia's charts claiming a 40-70% increase over GT200. It's great that they got more out of less, but the jump is still a lot less than the 100% (and sometimes greater) increase that RV870 has over RV790 in texturing, and RV790 held its own against GT200 in the texturing tests of Vantage or RightMark.

I was just posting that they changed the whole design with 64/256 vs gt200 80/80 and thats not something like downgrading to 64/64 like some people stated.
I don't know if 64/256 is an accurate description, despite it being listed that way on Anandtech. If it could do 256 filtered samples per clock, there would be some mention of it having ultra-fast AF. Tech-report is saying that it can only do 64 filtered texels per cycle. If it could do 256 point samples per clock, then I'd expect it would be more than just 2x as fast as GT200 in jittered sampling.
 
Like xman26's posts about 99% only working as 470s? Wouldn't consider him red tainted at least..
In RV770 (don't know about RV870), ATI had 1 redundant ALU block per set of 16, which is unusable if all 16 work. If Nvidia chose not to use this kind of redundancy, it has to sacrifice functional cores instead. Different design philosophies, both targeted at improving yields each with their trade-offs (one of them apparently being endless sniping about initial products not having all cores enabled.) Why do we care?

Would it make sense to start the process on B1 right away (and incoporate the things learned from A2/3 during the process), while still trying to at least improve yields with A2/A3 as a temporary solution?
I don't know.
BTW, 'trying to improve yields on A3' is not the correct way of looking at it. If it happens, it's because the TSMC process matures... Which is going to happen anyway. It always does.
 
I'm actually rather surprised that ATI didn't use half speed blending or FP16 given how much they wanted to keep die size in check, though I can see how it's just fixed function, fixed data path math logic that's too small to worry about.
On Cypress FP16 blend is half, write is full; Int blend is full rate.
 
Actually both texture and ROP is now running at a higher clock than before.
Higher than what? I thought "conventional wisdom" was pointing to ~600Mhz for at least the ROP domain (and according to CD the TMUs as well)
 
I know the higher end parts are the big news. But how about the lower end parts. Anyone hear anything about those ?

I think lacking lower level parts that sell in high volume for dx 11 is a bigger problem than not having a high end out. For every 1 5870 sold I'm sure ati sells 5 or so 5670s . The longer these low end parts go unmatched by nvidia the less of an advantage any increases in performance nvidia might have become as developers target the larger install base of dx 11 chips.

Nothing has taped out, nor has any prep been made to do so, so tapeouts are unlikely to be imminent. That puts things at 6 months or so out minimum.

-Charlie
 
Go on then.........

-Charlie

Because you calling me out (i can't sleep...):
Can you explain this?

GF100/GTX480 was not meant to be a GPU, it was a GPGPU chip pulled into service for graphics when the other plans at Nvidia failed. It is far too math/DP FP heavy to be a good graphics chip, but roping shaders into doing tessellation is the one place where there is synergy.
The same should hold true for all DX11 features, ATI has dedicated hardware where applicable. Nvidia has general purpose shaders roped into doing things far less efficiently. When you turn on DX11 features, the GT300 will take a performance nosedive, the R870 won't.
 
I still dont see it as crippling a product that was never intended to support a given feature meant for another line of products. It'd be like complaining about the power the V6 camero gives you instead of paying for the performance line V8 version. The 5.7L V8 in the camero is an option you MUST pay for, it isn't free. Same thing here. The features he is complaining about are ment for the higher end product.

Yes, but you can't change a fuse in the glove box and have the engine change from a V6 to a V8 now can you?

-Charlie
 
Willard, I understand where you are coming from but I think this discussion is actually pretty central to the GF100 speculation. One of the central points of question is if Nvidia took their eye off the prize and designed for the Tesla/Quadro space at a sacrifice of the Geforce space leading to a lot of their current problems.

2:1 SP to DP ratio doesn't answer that question?

-Charlie
 
I
Could it be that designing reticle sized chips for a cutting edge process is what is causing extra problems, not just normal execution ones? Similar doubts for LRB have been expressed before.

http://forum.beyond3d.com/showpost.php?p=1395490&postcount=5517

Intel has much more experience designing reticle sized chips, and has shown the ability to do so in volume, with high yields. They did so for several Itanium chips, and Becton is another good example. Intel is one of the best, if not the absolute best, at large die chips on a cutting edge process.

Nvidia has an issue with bumps cracking that they still can't admit the extent of.

-Charlie
 
BTW, what is your 4 month netlist to masks based on? There's fabless and then there's fabless, a company which doesn't do everything in house right up to the mask files for instance can't hope to get anywhere near to NVIDA.

He's probably talking about netlist to TO + fracture. Fracture alone can take quite some time of just pure number crunching but it is rare (I've never heard of any fab allowing a customer to do their own fracture) that a design group would be doing it.

assuming he's talking about synth/cell design, once you have the netlist, you have to do the P&R, then do cap extract, feed that back into P&R, send that back through cap extract, if that checks out, then you have to send it through DRC, then you have to do the full chip integration, send the whole thing through DRC and any other checks that the fab requires, encrypt the whole DB, email/send it to the fab for fracture. Once at the fab they have to decrypt the DB, start fracture to generate the mask sets for each layer, once they have the first layer of masks done (aka transistor layer) they can start putting the wafers through.

Even if you are just doing an incremental netlist change, you still have to go through the whole backend of an incremental P&R and all the DRC. Four months seems long to me, but since a lot of the work is compute bound, it can really depends on what compute resources you can throw at it.
 
I don't know if 64/256 is an accurate description, despite it being listed that way on Anandtech. If it could do 256 filtered samples per clock, there would be some mention of it having ultra-fast AF. Tech-report is saying that it can only do 64 filtered texels per cycle. If it could do 256 point samples per clock, then I'd expect it would be more than just 2x as fast as GT200 in jittered sampling.

From what I've read of the descriptions, it seems that it can do a max of 256 texel fetches per clock. But it is likely that there are address limitations on those texel fetches, re locality. If so the achieves texel fetches per cycle could be highly dependent on the texture surface type and the program requesting the textures.
 
2:1 SP to DP ratio doesn't answer that question?
No, not really. All the data paths are the same and you're just adding a little math in there. If ATI can do 640 DP adds per clock and 320 DP multiplies in a 334 mm2 chip with little cost, NVidia is barely increasing cost any more when putting in 256 DP FMAs in a >500 mm2 chip. Neither is investing much into DP; NVidia PR is just going on about it because they have a few extra feature in IEEE compliance. Moreover, the multiple tri per clock setup shows that they put a lot of thought and time into gaming performance.

Unfortunately, their architecture is just not very competitive now, particularly with their inability to reach G92 clock speeds. Fermi was probably mostly designed by the time RV770 came out and completely shattered their perception of the kind of perf/$ that ATI was capable of. Now we find out that ATI has a competitive edge in process knowhow, too.

Let's see what happens with NVidia's next gen, which is truly post-RV770. We've already seen how NVidia is planning to keep GT2xx as their mainstream this generation due to their underestimation of ATI.
 
From what I've read of the descriptions, it seems that it can do a max of 256 texel fetches per clock. But it is likely that there are address limitations on those texel fetches, re locality. If so the achieves texel fetches per cycle could be highly dependent on the texture surface type and the program requesting the textures.
Okay, makes sense. I guess expecting greater than 2x GT200 for jittered point samples is a bit much given the lower TU count and somewhat cache-unfriendly nature of such sampling.

It's funny how this feature, along with the drastic shift in ALU:TEX, is what we saw in R600...
 
No, not really. All the data paths are the same and you're just adding a little math in there. If ATI can do 640 DP adds per clock and 320 DP multiplies in a 334 mm2 chip with little cost, NVidia is barely increasing cost any more when putting in 256 DP FMAs in a >500 mm2 chip. Neither is investing much into DP; NVidia PR is just going on about it because they have a few extra feature in IEEE compliance. Moreover, the multiple tri per clock setup shows that they put a lot of thought and time into gaming performance.

A lot of this depends on granularity. multiplies have the greatest area expansion and it looks like AMD specifically designed the multipliers such that they didn't add a significant area increase by reducing the delivered throughput to the point where they could easily utilize only the existing hardware without having to add additional hardware support in. In contrast it seems that fermi maintains a 2:1 mul ratio which requires adding significantly more hardware into the design. AKA, a DP mul requires roughly 4x the area of a SP mul, where as for add its only roughly 2x.
 
A lot of this depends on granularity. multiplies have the greatest area expansion and it looks like AMD specifically designed the multipliers such that they didn't add a significant area increase by reducing the delivered throughput to the point where they could easily utilize only the existing hardware without having to add additional hardware support in. In contrast it seems that fermi maintains a 2:1 mul ratio which requires adding significantly more hardware into the design. AKA, a DP mul requires roughly 4x the area of a SP mul, where as for add its only roughly 2x.
Okay, I suppose NVidia may not reuse as much logic, but 256 DP FMAs is still peanuts for a 3B transistor chip when all the data flow is already taken care of. I can't see one being larger than the equivalent of 10,000 full adders, so it should be well under 2% of the die space for all 256.
 
Back
Top