NVIDIA GF100 & Friends speculation

I'm no process expert, but personally I don't know that it increases die size at all. The vias the connections between the metal layers, so increasing via redundancy would likely change how we handle the metal layers; if so, the increased cost would probably come with additional metal layers.
Ok, from Anand's piece I was under the impression that die size went up with doubled vias in RV740. Must've misunderstood that part then.

Not sure about Vantage, but curiously the erlier 3DMarks "Single Texture Filtering" test, always ended up being a bandwidth and integer blending test more than anythng else.
I know, but those didn't use FP16 blending AFAIK.
 
NVidia's architecture, with its hot clock, seems to require custom implementation for those parts of the die at TSMC. Though I'm not sure of the extent of that. That's more difficult than going fully synthesisable is it not?
We don't know to what extent the design is custom. But the speeds are too low for it to be full custom the way CPU are. They also have too many versions of similar designs. You don't do that with full custom.
You like to assert there's no causation. Well feel free to provide an argument against the repetitions, rather than hand waving.
I've given plenty of reasons why designs can be late. You call this hand waving. I call it experience.
I'll ask again: feel free to explain why NVidia has consistently struggled with chips that aren't feature increments (e.g. GT200b is A3), let alone the feature incrementing chips, in the same period on the same fab's nodes that AMD has executed on, usually in advance of NVidia.
I don't know. I'm sure they're not happy about it. The fact that I don't know doesn't make me want to come up with theories that make me look dumb.
If those two vias are for the same signal (which they are), you can probably cut it a bit fine.

I highly doubt if it will cause too much bloat.
The density of standard cell designs is determined by the ability to wire the cells together. If you're going to use two via's instead of one, you necessarily take up some resources on the metal layers above and below which decreases your routing density. So you have to space cells further apart and area increases.
Ok G80 looked flawless because the competition was so bad at the time so nivida did not have to be too aggressive with the clocks. Now if ATI messed up again the Gf100 would probably look flawless also. Nvidia themselves have hinted at some of the issues the via errors were too high they were getting and the yields but because TSMC is an impotant partner the toned it down a notch a two.
I didn't say anything about G8x/G9x/... vs competition. (Let alone anything about performance which is architecture related.) But if you must: G80 was 9 months earlier than R600. G84 and G86 were on the market before R600. G92 and GT200 were on the market before RV670 and before RV770, if only by a few weeks.

I'm sure their internal roadmaps were more aggressive. They are at all companies, if only to 'encourage' employees to work harder. The problem of Nvidia is that they have been facing the rare competitor that has executed to perfection with an architecture that's simply more area efficient.

As for the via's: that's much easier to fix than channel length variability. In that same article, TSMC was quoted as saying it was a teething problem that was solved later. I don't find that very hard to accept.
 
GF100 has Texture Address/Filtering 64/256 compared to gt200 80/80.
The single table mentioning 256 TF for Fermi meant 256 Texture Fetches per clock. So no, there are no 256 filtering units.
I may be wrong, but you need 4 L/S Units for one bilinear sample. So for graphics the GF100 has only 4 filter units.
But the L/S units in the SMs most probably don't do the texture fetches, they just pass through texture instruction to the TMUs (where are seperate TA units) and write the filtered values into the register files later. The L/S units are for the access to the linear global memory through the L1 cache, the texture units have their own texture L1 accessed by the TUs.
According to the Fermi graphics whitepaper Fermi can deliver 64 filtered texels and unfiltered texels from at most 128 individual adresses (with 4offset_gather4, otherwise from 64 adresses) per half hot clock. But that doesn't exclude efficiency improvements like delivering 64 trilinear filtered texels per clock or something like that if the texture cache has enough bandwidth.

Edit:
That was from memory. As Sontin just quoted below, the whitepaper doesn't mention the half hot clock specifically. And some of the information comes from answers Damian Triolet got from nv.
 
Last edited by a moderator:
Purely in theory: what speaks against exactly having the TAs running at half the hot clock and the TFs at hot clock?

The problem with all answer from nVidia is that i don't see the filter units running at hot clock.
But they mentioned a few times that the clock will be higher than today...

What was the technical justification for the use of 64 TMUs in the full GF100 design, when the full GT200 used 80?

  • In the GF100, each of the four graphics processing clusters (GPCs) contains four streaming multiprocessors (SMs) which are individually equipped with four dedicated texture mapping units (TMUs,) totaling 64 across the chip. Previously, texture engines were contained in a separate subcomponent block that forced all three SMs within the processing cluster to share the same set of texturing units. To optimize texture performance, the texture engines and caches were moved within the SMs; efficiency has been further improved by redesigning the L1 cache, the unification of the L2 cache, and increasing the texture engine/cache operating frequency coefficient to allow for operating speeds greater than the ROP (core) clock [Actually both texture and ROP is now running at a higher clock than before.] Moreover, the TMUs within the GF100 support new DirectX 11-standard texture compression formats that reduce memory consumption in HDR rendering environments. In addition, I would say that in real world measured performance, both in terms of synthetic tests (ie. 3D Mark Vantage texture test) and texture limited frames in games, GF100 outperforms GT200. In effect, GF100 does more with less.
http://forums.nvidia.com/index.php?showtopic=159270&view=findpost&p=1003188

And from the whitepaper:
The goal with GF100 was to improve delivered texture performance through improved efficiency. This was achieved by moving the texture units within the SM, improving the efficiency of the texture cache, and running both the texture units and texture cache at a higher clock speed.
...
The texture units on previous architectures operated at the core clock of the GPU. On GF100, the texture units run at a higher clock, leading to improved texturing performance for the same number of units.

It would be very stupid when the "TU clock" is slower than on the GTX285. Maybe there will be no "hot clock/2" because i don't find anything about it.
 
its interesting they cut back on thier TMU's and everyone is talking about texturing performance now, since the g80 pretty much has no hit from AF and filtering, I don't think its a big deal they cut back.
 
Last edited by a moderator:
GF100 gives the appearance of needing a B refresh to achieve decent performance/yields. Such a refresh (if it happens) makes it 3 to 4 quarters late.

Anyway, I'm not counting chickens till the damned thing has been on the market a while. Demand will be "insane" if it's at all good, so it'll be a while before we know whether NVidia can keep up with demand. Then we'll get a feel for whether it's yielding well.

Of course if the reviewed chips are as bad as Charlie asserts then the case will be closed. I don't believe the "5% on current games" thing.

(I don't think texturing capability is going to kill performance, though being 59% of HD5870's theoretical does cause some qualms - I'm assuming NVidia's managed a monster boost in efficiency there and most games seem to show little dependency on texturing. Also, ROP performance - Z rate specifically - appears to be considerably better in GF100, and current games tend to indicate this is where most pain lies.)

NVidia's architecture, with its hot clock, seems to require custom implementation for those parts of the die at TSMC. Though I'm not sure of the extent of that. That's more difficult than going fully synthesisable is it not?

G94 is the only chip from the last few years that NVidia's apparently delivered "on time". NVidia has also cancelled two chips (GT212 and GT214 - a third if we count G88 which I'm still not sure about). The hot-clock based architecture appears to be making things quite difficult for NVidia. In the same period ATI chips with greater feature increments (D3D10.1, two variations of LDS, GDDR5) and higher performance have shown considerably less susceptibility to delays - with RV740 having the worst problems.

You like to assert there's no causation. Well feel free to provide an argument against the repetitions, rather than hand waving.


I never said it was unmanufacturable, I said Charlie's theory appears to hold some water, emphasis on "some".

I'll ask again: feel free to explain why NVidia has consistently struggled with chips that aren't feature increments (e.g. GT200b is A3), let alone the feature incrementing chips, in the same period on the same fab's nodes that AMD has executed on, usually in advance of NVidia.

Apart from the difficulties of custom design the other factors I can think of include packaging-related stuff (bump-gate) and NVidia's apparent reticence to be first to a node (or inability). Though NVidia did boast that it would be first to 40nm, I'm not quite sure why. Unless it was an attempt to assuage rumblings that 40nm was going to be a problem and NVidia wanted to keep Wall Street off its back by saying it was ahead of AMD for 40nm.

Jawed


Jawed, are you actually saying Charlie turned you :D ok, I'm not saying anymore, just think about it a little bit, and never talk about the bump gate thing cause he did state ever damn card that came out of nV had that problem, and it was only 2 lines.
 
NVidia's alluded to problems in implementing the distributed setup scheme. Is it possible that metal spins can reduce these problems?

Jawed
Metal spins can fix these problems if they are minor functional bugs. Designs frequently have bugs of this variety. Major issues would likely require a base spin.
 
We don't know to what extent the design is custom. But the speeds are too low for it to be full custom the way CPU are. They also have too many versions of similar designs. You don't do that with full custom.
AFAIR, the SM's are full custom. Not 100% on it though. Will post the link when I can find it again.

The density of standard cell designs is determined by the ability to wire the cells together. If you're going to use two via's instead of one, you necessarily take up some resources on the metal layers above and below which decreases your routing density. So you have to space cells further apart and area increases.

It'll have less effect if that bit of logic was custom designed, wouldn't it?

I'll ask again: feel free to explain why NVidia has consistently struggled with chips that aren't feature increments (e.g. GT200b is A3),

This raises an important question. If metal spins fix logical bugs, not leakage, not power, not clocks, not yields and GT200 didn't have any showstopping bugs (it did ship on 65 nm after all), then why would they need 3 metal spins to shrink it?
 
Oh dear.... :rolleyes:

Looks like Jawed was right. Somebody needs to put "Graphics and Compute/GPGPU are joined at the hip :yep2:" in his sig.

Any volunteers? ;)
So this is code for "swaaye, you're an idiot, and I shall make a spectacle of you LOLZ!".

Are you saying that the needs of CUDA/friends are so perfectly in alignment with graphics that the significant focus on the former in the new chip will not impact its efficiency on graphics? Forgive me if I don't go read all ~200 pages of the hot topic rant/speculate/ridicule threads.
 
@ silent guy

So it is possible that NV and TSMC expected to fix the rumoured GF100 512SP part problems with metal spins. And only adter A3 deemed it unavoidable to do a B1/2.
 
It'll have less effect if that bit of logic was custom designed, wouldn't it?
I don't know.

This raises an important question. If metal spins fix logical bugs, not leakage, not power, not clocks, not yields and GT200 didn't have any showstopping bugs (it did ship on 65 nm after all), then why would they need 3 metal spins to shrink it?
Noise, RI drop or analog fixes? But, yeah, it's a interesting question. It must be that the architecture is unmanufacturable... ;)
 
Are you saying that the needs of CUDA/friends are so perfectly in alignment with graphics
Yep, D3D11 requires compute and there are games on the market with compute.

that the significant focus on the former in the new chip will not impact its efficiency on graphics?
Compute makes graphics faster, so the impact is positive ;)

Jawed
 
@ silent guy

So it is possible that NV and TSMC expected to fix the rumoured GF100 512SP part problems with metal spins. And only after A3 deemed it unavoidable to do a B1/2.
Even if they saw after A1 that they couldn't reach the speeds they wanted and decided right away that a B1 could improve things, they'd still want to fix the logic bugs in metal first. It takes at least 4 months go from a netlist to tape-out, another 2 months until silicon and only then you can start with qualification again.
 
its interesting they cut back on thier TMU's and everyone is talking about texturing performance now, since the g80 pretty much has no hit from AA and filtering, I don't think its a big deal they cut back.

where on earth does this idea of no hit come from?

100%
prey-1600.png

73%
prey-1600aa.png

100%
hl2ep1-1600.png

80%
hl2ep1-1600aa.png

100%
aoe3-1600.png

61%
aoe3-1600aa.png
 
Even if they saw after A1 that they couldn't reach the speeds they wanted and decided right away that a B1 could improve things, they'd still want to fix the logic bugs in metal first
It just depends on their level of confidence, they had confidence in functional first silicon (at least publicly). Why wouldn't they have confidence in having a complete view of necessary fixes the second time around? At worst it just means they wasted another couple of million ...

BTW, what is your 4 month netlist to masks based on? There's fabless and then there's fabless, a company which doesn't do everything in house right up to the mask files for instance can't hope to get anywhere near to NVIDA.
 
IIRC, historially NVIDIA have operated at half the blend rate in many cases, thats not the case with Cypress.
Okay, that changes the theoretical rate, but not the acheivable fillrate. 153GB/s will only get you 19 GPix/s with FP16 or blended RGBA8, and that's assuming zero bandwidth for the Z-buffer and perfect colour compression with AA. Add those in along with less than perfect transfer efficiency and it's unlikely that Cypress benefits at all from having full speed blending or FP16 in its ROPs. Well, maybe it helps making up for rasterization inefficiency or discarding pixels from failed alpha tests.

I'm actually rather surprised that ATI didn't use half speed blending or FP16 given how much they wanted to keep die size in check, though I can see how it's just fixed function, fixed data path math logic that's too small to worry about.
 
Back
Top