NVIDIA GF100 & Friends speculation

I don't see the relevance of your link.

Jawed was suggesting that the whole G80-based architecture was fundamentally unmanufacturable because only G80 was supposedly on time. The fact that G80 was a seriously large die itself already contradicts that very statement, but never mind: the idea that smaller to very small versions (G88, G86, G98) were late because of the architecture is laughable and a poster child of a 'correlation doesn't mean causation' argument. (Charlie is way better at this, though, see the R&D story, which must have been the most embarrassing article ever on his website.)

We don't know why they were late (if they were: do you know the internal roadmap?), but unless there are serious process issues (and 40nm is only one in recent history where this was the case).

The list is endless. Frankly, I wouldn't even know how to design a chip with an architecture that's somehow fundamentally unmanufacturable even though the first (large) version comes out flawless. I would love to hear specific details from Jawed about exactly what would make an architecture unmanufacturable. And how GDDR5 fits in that picture is a similar mystery.
Ok G80 looked flawless because the competition was so bad at the time so nivida did not have to be too aggresive with the clocks. Now if ATI messed up again the Gf100 would probably look flawless also. Nvidia themselves have hinted at some of the issues the via errors were too high they were getting and the yields but because TSMC is an impotant partner the toned it down a notch a two.
 
But why should the tesla cards have a higher clockrate than the consumer products?
Wait for products before arguing over final clocks? You're taking the pre-announced Tesla clocks as indicative of consumer clocks. Since the rumours are all over the place (e.g. GTX480 is now rumoured to be a full set of 16 multiprocessors, not cut-down) there's nothing to conclude. If the chip was on-time you might have a decent argument in the face of wild rumours.

And alot of alus, too. The ratio between alu:tmu of r600 and rv770 is the same.
So, what, as AlexV asked? AMD added them - indeed AMD re-balanced the texturing quite considerably, specifically because R600's texturing was ill-balanced. So you literally have no point when you try to suggest that AMD did not recognise the importance of increased texturing performance in R600's successors as both performance and efficiency were greatly increased.

Jawed
 
GF100 has Texture Address/Filtering 64/256 compared to gt200 80/80. So thats not exactly like step back, altough they noticed the higher clocked TUs in GPC which on 600 MHz is not happening.

I may be wrong, but you need 4 L/S Units for one bilinear sample. So for graphics the GF100 has only 4 filter units.

Wait for products before arguing over final clocks? You're taking the pre-announced Tesla clocks as indicative of consumer clocks. Since the rumours are all over the place (e.g. GTX480 is now rumoured to be a full set of 16 multiprocessors, not cut-down) there's nothing to conclude. If the chip was on-time you might have a decent argument in the face of wild rumours.

The only official numbers are the tesla clocks. And i don't think that the geforce products will have lower clocks and a higher tdp.

So, what, as AlexV asked? AMD added them - indeed AMD re-balanced the texturing quite considerably, specifically because R600's texturing was ill-balanced. So you literally have no point when you try to suggest that AMD did not recognise the importance of increased texturing performance in R600's successors as both performance and efficiency were greatly increased.

Jawed
They went back to int8 units and increase the amount to 40tmus. But they also increased the alus to 160 Vec5 units. It stays at the same ratio. So AMD didn't see a importance in tmus.
 
I may be wrong, but you need 4 L/S Units for one bilinear sample. So for graphics the GF100 has only 4 filter units.

I was just posting that they changed the whole design with 64/256 vs gt200 80/80 and thats not something like downgrading to 64/64 like some people stated.
The target clocks were 1500/750 so with the 750 MHz clocks it would still have 48 Gtexels/s while all quad filter units share the 768KB L2 cache besides their own 12KB.

The pixel fillrate could be a bigger problem there. With 48 ROPs and let we say half hot clock 600 MHz it has 28.8 GPixels/s. Rv870 has 27.2 Gpixels/s at 850 MHz, at 1000MHz it will have 32 GPixels/s.
No wonder if GF100 cant push more fps with average games.
 
Tesselation won´t save the 480, as real games put a strain on the shaders, which lack power for shading and tesselation. in real games 5870 will walk all over Fermi in that situations. - if you follows Charlies logic.
This is just typical dumbass logic from Charlie when he doesn't know what he's talking about.

Pixel shader load is very low in segments of the scene where triangle count and tesselation is high. Cypress can only push half a tesselated vertex per clock (since each vertex generates two triangles to be set up). Even if you need 100 flops per vertex, it would take a mere 5% of Fermi's shading power to match that speed. Okay, maybe 7% including clock speed differences, but there's plenty of room for Fermi to take advantage of its 3-4x setup rate increase and smoke Cypress.

Math is not going to be a problem. Texturing very well could be, though. If they're only 20% faster than GT200 in internal testing, and RV790 was near and often faster than GT200 in texture heavy tests, it could be a real problem. I think we will find some theoretical tests where even the 9800+ beats Fermi due to its higher clock.
 
The pixel fillrate could be a bigger problem there. With 48 ROPs and let we say half hot clock 600 MHz it has 28.8 GPixels/s. Rv870 has 27.2 Gpixels/s at 850 MHz, at 1000MHz it will have 32 GPixels/s.
Pixel fillrate should still be an advantage, because it only really matters for bandwidth heavy transparent pixels, and Fermi has a 384-bit bus. Not much of the workload goes into pixels simple enough to max out 28 Gpix/s (FYI, that will fill a 2560x1600 screen in 150 microseconds).

But for the sake of theoreticals, remember that Fermi can only output 32 pixels per half-hot-clock. The 48 ROPs are there because they're tied to the six partitions of the 384-bit bus. So on paper, it would be 27.2 Gpix/s for Cypress and 19.2 Gpix/s for a 600MHz Fermi. For the important cases of blending, FP16, and 8xAA, [strike]both cards will be on even keel theoretically with have the rates you calculated, and usually[/strike] Fermi will have the advantage in reality due to bandwidth.
 
Last edited by a moderator:
I was just posting that they changed the whole design with 64/256 vs gt200 80/80 and thats not something like downgrading to 64/64 like some people stated.
The target clocks were 1500/750 so with the 750 MHz clocks it would still have 48 Gtexels/s while all quad filter units share the 768KB L2 cache besides their own 12KB.

Evergreen and Fermi both have quad tmus....So Rv870 still has more texture power than GF100...

Thus, the amount of texture processors in the RV870 is doubled (from 40 to 80 TMUs). The peak texture sampling performance has doubled, too. The overall architecture of texture processors seems to have been left largely intact. Each of them still consists of 16 FP32 texture fetch units, four address units and four filters.

http://www.xbitlabs.com/articles/video/display/radeon-hd5870_3.html
 
For that, Cypress will have to be setup limited in the first place.
Well, the discussion is currently talking about future (and a couple current) games where tesselation slows down Cypress. In those situations it is partially setup limited.

In other words, Fermi could make the performance impact half or one third compared to Cypress. Not that I'm saying such an attribute is enough to justify such a monsterous chip and its delays...
 
Math is not going to be a problem.
Domain shaded vertices cost considerably more than non-tessellated vertices, per vertex. The question is, "how much more?" and what proportion of that is math.

Texturing very well could be, though. If they're only 20% faster than GT200 in internal testing, and RV790 was near and often faster than GT200 in texture heavy tests, it could be a real problem. I think we will find some theoretical tests where even the 9800+ beats Fermi due to its higher clock.
Part of tessellation is "texturing" and Fermi's L1/L2 cache system should make this more efficient. There's an interesting note in the Fermi Tuning Guide:

On devices of compute capability 1.x, some kernels can achieve a speedup when using (cached) texture fetches rather than regular global memory loads (e.g., when the regular loads do not coalesce well). Unless texture fetches provide other benefits such as address calculations or texture filtering (Section 5.3.2.5), this optimization can be counter-productive on devices of compute capability 2.0, however, since global memory loads are cached in L1 and the L1 cache has higher bandwidth than the texture cache.

Jawed
 
For the important cases of blending, FP16...

IIRC, historially NVIDIA have operated at half the blend rate in many cases, thats not the case with Cypress. Additionally, IIRC, FP16 was half rate for Fermi and again not the case with Cypress. In these cases Fermi may be operating at a net for 24Pixels per clock while Cypress is still at 32.
 
And alot of alus, too. The ratio between alu:tmu of r600 and rv770 is the same.
Removal of interpolation on Cypress has actually increased the ratio on the texturing side to some extent. Time and applications have also moved on and their usage of texture and math has as well.

When you dig into the surface format support differences between R600, RV770 and Cypress you'll also find that there are a lot of differences.
 
I find that hard to believe, given Anand's reporting of Cypress being trimmed down from ~480ish mm² to 340ish already. If the former is without doubled vias and the latter with them in place, they must have removed half the chip to do so.
Anand was talking about the engine configuration there. Increasing the VIAS is part of the "Design for Manufacturing" process we have.
 
Anand was talking about the engine configuration there. Increasing the VIAS is part of the "Design for Manufacturing" process we have.
So, if i got it right this time: the first assumption of 480ish sqmm was with the original engine configuration and sideport in place and has been trimmed down to sub-334 sqmm. Increasing the vias has then upped the number to 334 sqmm again?

IIRC, historially NVIDIA have operated at half the blend rate in many cases, thats not the case with Cypress. Additionally, IIRC, FP16 was half rate for Fermi and again not the case with Cypress. In these cases Fermi may be operating at a net for 24Pixels per clock while Cypress is still at 32.

Would i be able to observe this in one of 3DMark Vantages Feature Tests?
 
IIRC, historially NVIDIA have operated at half the blend rate in many cases, thats not the case with Cypress. Additionally, IIRC, FP16 was half rate for Fermi and again not the case with Cypress. In these cases Fermi may be operating at a net for 24Pixels per clock while Cypress is still at 32.
It's possible, but I thought that those units were made the way they are due to memory bandwidth limitations, such that it doesn't make a whole lot of sense to make FP16 as fast as 8-bit integer.
 
I may be wrong, but you need 4 L/S Units for one bilinear sample. So for graphics the GF100 has only 4 filter units.

Purely in theory: what speaks against exactly having the TAs running at half the hot clock and the TFs at hot clock?
 
Increasing the vias has then upped the number to 334 sqmm again?
I'm no process expert, but personally I don't know that it increases die size at all. The vias the connections between the metal layers, so increasing via redundancy would likely change how we handle the metal layers; if so, the increased cost would probably come with additional metal layers.

Would i be able to observe this in one of 3DMark Vantages Feature Tests?
Not sure about Vantage, but curiously the erlier 3DMarks "Single Texture Filtering" test, always ended up being a bandwidth and integer blending test more than anythng else.
 
Back
Top