NVIDIA GF100 & Friends speculation

But this would mean that all other GF10x GPU would have a big number of useless ROPs?

My bet is modified ROP paritions with (4 ROPs, 64KiB and 64-Bit each) and they just messed it up in press material.
 
in terms of raw pixel throughput that's true. It isn't limited any more by ROP count in Fermi. And that's also the reason you're seeing "wrong" pixel fillrates all over the place(s) - because they're based on ROP count.
 
But if GF108 can do such a performance with 4 ROPs, why GF100 needs 48 ones?
Or is this more about computing, like doing atomic operations?
 
But if GF108 can do such a performance with 4 ROPs, why GF100 needs 48 ones?
That's what I was wondering for a long time already... Note though it's not that bad with GF100 - 30 (32 for hypothetical full chip) pixels/s vs 48 ROPs. The ratio gets worse with GF104 (with 7 SM, 14 pixels/clock vs 32 ROPs - I believe this is at least part of the reason the performance difference between 192 and 256bit cards is quite small despite having 1/4 less bandwidth, i.e. the addtional rops are useless). The ratio goes to WTF? on a 192bit GF106 (which doesn't exist but really 24 ROPs but pixel output limited to 8 looks totally crazy.)
 
on a 192bit GF106 (which doesn't exist but really 24 ROPs but pixel output limited to 8 looks totally crazy.)
They exist: GTX 460M, GTS 450 OEM, GTS 440 OEM. All are 192-bit GF106 und the two later ones 144SP parts (6 pixels on 24 ROPs :LOL:).
 
They exist: GTX 460M, GTS 450 OEM, GTS 440 OEM. All are 192-bit GF106 und the two later ones 144SP parts (6 pixels on 24 ROPs :LOL:).
Ah right. Should have known about the mobile one, though the OEM desktop ones are new to me - the GTS 450 OEM looks particularly mean, as only the "OEM" distinguishes it from a regular GTS 450. I guess that's just what OEM likes, it has more ram who cares about performance anyway? Any reviews somewhere for this? I'd nearly bet the additional bandwidth (and certainly the additional ROPs and memory) are useless and it's just the 1/4 slower according to the SM count vs. a "normal" GTS 450.
I've really wondered about that 192bit bus on GF106 before, as it just looks very seriously misbalanced, and I just can't think it can be worth the additional die area (except if it would be used with ddr3, which is the case with the GT440, but then the rop count is of course still ridiculous).
 
The 192-Bit MC allows more flexibility in memory configurations and to clock the memory very low on mobile versions (GTX 460M has GDDR5 @ 2.5Gbps).
But of course, ROP-paritions with 4 ROPs and 64KiB L2 cache would be better.
 
The 192-Bit MC allows more flexibility in memory configurations and to clock the memory very low on mobile versions (GTX 460M has GDDR5 @ 2.5Gbps).
However, it also seems to be one of the reasons why GF106 has such a bad perf./mm². 8 ROPs, 64-bit mem interface and 1/3 of the L2 cache are basically dead weight on most GF106 SKUs. Was that higher flexibility worth the additional die area? I have my doubts.
 
I'm reading rumors of some GF110 chip based GTX 580 card, coming in December. You guys know anything about that? Is this the thread that a GF110 chip should be discussed?
 
Maybe a base layer respin for the GF100?
Fudo claims it has 128 TMU and 512-bit mem interface, that would rule out a respin IMO. Of course he could be wrong. Or GF100 has always had those 128 TMU and 64 ROP/512-bit and all the missing ones were just disabled due to abysmal yields.

We'll see...
 
My thoughts on this is that GF110 could be a sub 500mm2 GPU made only for gaming.
GF100 will be sold as Quadro and Tesla until early 2012.

In my GF110 SMs would look like this:
attachmenty.png

  • execution is superscalar like in GF104
  • 8 TMUs to have high-speed 16x AF
  • full tess- and geo-power
  • half-rate DP is removed and done like in GF104 via emulation on one of the Vec16 ALUs (1/8 rate)

16 of these SMs would give 512 SPs + 128 TMUs + 16 tessellators

On memory/ROP side I would optimize this:
  • downcut of L2 to 64KiB per partition, like they did on GF108
  • target frequency of ~1GHz GDDR5

8 partions would give 512-Bit (256GB/s @ ~ 1GHz), Cayman probably reaches around 190GB/s. The number of ROPs would be 64 and the L2 cache would have the size of 512KiB, which seem enough for gaming (GTX 460s 336SPs are seem well suited with 384KiB @ 192-Bit version).
 
Last edited by a moderator:
My thoughts on this is that GF110 could be a sub 500mm2 GPU made only for gaming.

Not bad but I would be wary of two things. Depending so heavily on ILP is a big gamble for a flagship part and although GF104 seems to do alright it still shows weakness in some workloads and its worst case scenario is 66% throughput, not 50% as would be the case for your hypothetical GF110. The other thing is the 512-bit bus. Nvidia seems to be losing their love affair with ROPs so maybe they can do 32 ROPs on a 512-bit bus like GT200. Don't think a GF100 arrangement with 64 ROPs required would result in a < 500mm^2 die though.

In terms of hidden TMUs in GF100 it's silly to think disabling one TMU quad per SM on every chip would improve yields. The defects would have to be far too specific and uniform for that to be the case.
 
neuer Chip mit 768 Shader-Einheiten und 128 TMUs an einem 512 Bit DDR Speicherinterface, ohne den GF100-Ballast nach GF104-Vorbild, Die-Fläche mit 550-650mm² größer als GF100 und damit auch höhere Verlustleistung, Zielmarkt dieses Grafikchips wären dann eher die DualChip-Lösungen von AMD
Potential gegenüber GeForce GTX 480: ca. +50%

Source: http://www.3dcenter.org/news/2010-10-13

768 Cude cores
128 TMUs
2GB DDR5 on a 512bit bus
550mm2 to 650mm2
Compared to something about dual chip from AMD?
50% quicker than GTX480

Now look an uncrippled GF104 (GTX460 1GB)
384 Cuda cores
64 TMUs
1GB DDR5 on a 256bit bus

384 Cuda cores x2 = 768 Cuda cores (check)
64 TMUs x2 = 128 TMUs (check)
1GB DDR5 on a 256bit bus x2 = 2GB DDR5 on a 512bit bus (check)

Sounds like a GTX460x2 to me :D

Hmm, he might actually have a point.
 
Personally I think a "upscaled" GF104 would make a lot more sense rather than some new "GF100/GF104 hybrid". That said, two times GF104 would be really huge (can't see why it would be smaller than 600mm² - what was the die size limit again?), unless nvidia managed to squeeze some more transistors into the same area. Also, clocks couldn't be too high to fit into 300W TDP, though I guess could still be similar to GTX460 levels. If Cayman is only 3/2 times Barts such a GF110 would at least be definitely the fastest single chip part.
Makes inventory keeping more complicated though, nvidia would have two huge parts for different markets (and both with rather small quantities compared to more mainstream parts).
 
execution is superscalar like in GF104
The pic you provide is exactly an GF100 SM cluster, except it has 8 TMUs instead of original 4. So why make it superscalar if the original worked without it? :)

To me, a GF104x1.5 makes the most sense. Should be smaller than a GF104, more shaders, more TMUs, more SFU's, 384bit mem bus, 12 polymorphs still plenty enough tesselation power, wattage around 250W, 48ROPs and 768kb L2 cache to match GF100.... overal easily the 20% or more performance than a full fermi, needed to tackle Cayman, in my guestimate.

One problem I see though is GPC organisation. Having three with 4SMs each might be problematic to arrange on a chip. So they could make a GPC consist of only 3 SMs each. But with 4 GPC you have the same problem as with GF100, geometry needs to communicate with each other, so that's a lot of interconnects to be added. Only 2 GPCs would simplify that but having 6SMs in one GPC? :oops:

BTW. I'd love to hear a more educated opinion about the way GPCs communicate with each other in geometry stuff. Charlie claimed the complicity there is one of the main reasons for Fermi's lack of success. But a more technical explanation did not follow. :)
 
Back
Top