NVIDIA Kepler speculation thread

I'm guessing around 800 MHz (14 SMXes) at 250 W, which would give 10% better FLOPS/W than the 13 SMX K20.

Also, I'm wondering if there will be a second revision of the K20, at least for Teslas, in late 2013 to 2014 with 15 SMXes and faster memory.
 
2013 (rather 2014) will see Maxwell. A second Kepler-Tesla revision then wouldn't make much sense.

How much do 3 GB GDDR5@1250 MHz consume?
 
I wouldn't bet on more than 850 MHz.

Agreed, 800-850mhz is going to probably be what GK110 ends up running at in it's highest end Geforce form.


How much do 3 GB GDDR5@1250 MHz consume?

I can't answer that, but I know I've seen power consumption numbers in reviews all over the web with gtx670 and gtx680's 4 gig cards that aren't really consuming that much more power than their 2gb counterparts.

http://www.xbitlabs.com/articles/graphics/display/evga-geforce-gtx-670-4gb_5.html#sect1 - the same power draw
http://www.pcper.com/reviews/Graphi...aphics-Card-Review/Power-Consumption-and-Temp - pcper gtx670 4gb is running at a higher clockspeed, drawing 10 more total watts under load

I doubt the removal or addition of 3gb of vram affects the vendor's final TDP numbers. We are probably talking less than 5 watts total (guess).

GTX680 is bandwidth constrained. It probably needs 6.6ghz vram before the bottleneck is mostly (or entirely) alleviated. That said, a 14 SMX GK110 with 1500mhz ram and 800mhz clock speed, would have 40% more core performance and 50% more ROP's and memory bandwidth. The gtx580 had 25% more core performance and 25% more bandwidth (50% more ROP's) than the gtx560ti and ended up consistently ~40% faster. It will come down to TDP, but the potential for huge performance out of GK110 is there.
 
Last edited by a moderator:
I can't answer that, but I know I've seen power consumption numbers in reviews all over the web with gtx670 and gtx680's 4 gig cards that aren't really consuming that much more power than their 2gb counterparts.

http://www.xbitlabs.com/articles/graphics/display/evga-geforce-gtx-670-4gb_5.html#sect1 - the same power draw
http://www.pcper.com/reviews/Graphi...aphics-Card-Review/Power-Consumption-and-Temp - pcper gtx670 4gb is running at a higher clockspeed, drawing 10 more total watts under load

I doubt the removal or addition of 3gb of vram affects the vendor's final TDP numbers. We are probably talking less than 5 watts total (guess).

It depends if that extra RAM is actually in use. In those game benchmarks on a GK104 it will presumably be idle most of the time doing little reading or writing, while the opposite will be the case on a K20 running optimised code.
The TDP difference will be more significant than those benchmarks show since TDP measures theoretical peak power.
 
Good point! One should benchmark with 8xMSAA, record all the power consumed over the course of the benchmark, see to it that it is repeatable value and then compare 2GB vs 4GB cards. To be as accurate as possible, one should also disable any boost (if one is active on the card).
 
2013 (rather 2014) will see Maxwell. A second Kepler-Tesla revision then wouldn't make much sense.
For my statement above I was assuming that the first Maxwell chips would be like GK104, so not a "real" successor to GK110. So there would be a place for a GK110 revision, but they may not do that in any case.

GTX680 is bandwidth constrained. It probably needs 6.6ghz vram before the bottleneck is mostly (or entirely) alleviated.
How would a ~1100 MHz 8 SMX GK114 with 7 Gbps memory do in terms of bandwidth constraints? Could it be ~15% faster than the 680?
 
They created a lot of them for Titan (or the XK6/7 clusters in general), so one would need to repeat that, albeit from a lower base.

My guess is a lot of sw support for the GPUs came from NV. I dont see AMD providing anything more polished beyond their DX11 driver.
 
GTX680 is bandwidth constrained. It probably needs 6.6ghz vram before the bottleneck is mostly (or entirely) alleviated. That said, a 14 SMX GK110 with 1500mhz ram and 800mhz clock speed, would have 40% more core performance and 50% more ROP's and memory bandwidth. The gtx580 had 25% more core performance and 25% more bandwidth (50% more ROP's) than the gtx560ti and ended up consistently ~40% faster. It will come down to TDP, but the potential for huge performance out of GK110 is there.

A 14 SMX GK110 with 800 mhz core(*) and 1500mhz memory, compared to a (average review clock) 1080mhz 680:
+29% ALU/TEX
+50% BW
+11% ROP
-8% setup(!)

580 to 560ti:
+25% ALU (but higher efficiency)
+50% BW
+40% ROP (but not export)
+100% setup
and yes, around +40% game performance
But shader/texture units are very different between gf104 and gf110, so it's hard to compare the numbers to the GK110 situation, where it looks like register space (doubt it will mean much/anything for usual graphics) and L2 cache is the only graphics related difference.

But claiming as much as 50% game performance seems overly optimistic from the above numbers..

(*) I would probably expect more like 850mhz usual boost, but ofcourse depends on yields and binning of those tesla cards. If the 225w 725mhz teslas (double memory, but no fan) is from the creme-de-la-creme bin, 850mhz with 250-300w can be hard on the "cheap" geforces. The teslas are clocked for maximum efficiency, while the geforces will be for maximum performance on a acceptable power budget, ie just before the power-curve goes completely wild.
 
Last edited by a moderator:
Setup is not necessarily diminished, in fact I'd bet it won't be. Scan rate is 4 ppc/smx now with Kepler and was 2ppc/sm in Fermi, triangle are also a function of SM: 0,25 tpc/sm in Fermi, 0,5 tpc/smx in Kepler, IIRC.
 
Setup is not necessarily diminished, in fact I'd bet it won't be. Scan rate is 4 ppc/smx now with Kepler and was 2ppc/sm in Fermi, triangle are also a function of SM: 0,25 tpc/sm in Fermi, 0,5 tpc/smx in Kepler, IIRC.
So GPCs don't exist at all? :D
What you appear to allude to is a system of a completely distributed setup/raster. I always had the impression it was still centralized (but parallel) and the 2 pixel/clock limit of Fermi was actually a limitation of the pixel export bus (not raster) as it also scaled with the size of the color format. Kepler upped that to 4 pixel/SMx. The still open question is how many triangles can GK110 setup in parallel? Does it really scale with the number of SMx?
 
Last edited by a moderator:
What is the decrease in the setup?

GK104 has 4 GPCs (with 2 SMX in each), while GK110 only has 5 (with 3 SMX in each) at the lower clock. Makes a lot of sense for a more compute oriented GPU of course. Dunno how much it may limit performance in 110 as 104 has a lot already (but that may also be part of it's low-res advantage on Tahiti).
(and by setup I don't just mean peak triangle rate, but generally all the graphics things that must scale with the GPCs).
 
Are you sure, I thought it had 6?

Unless I've missed something the GK110 whitepaper doesn't clarify. However given that there are 15SMXs in total, wouldn't you suggest that 5 GPCs with 3SMXs each is a more "reasonable" layout?
 
Unless I've missed something the GK110 whitepaper doesn't clarify. However given that there are 15SMXs in total, wouldn't you suggest that 5 GPCs with 3SMXs each is a more "reasonable" layout?

I think it was reasonable until GK106 is out though.. It may be asymetrical 7x2 + 1x1 in total 8 GPCs.. It would be similar to Fermi' fashion biggie has double setup compare to the middle sibling..
 
I think it was reasonable until GK106 is out though..

Why?

GK104 = 4 GPCs = 4*8 pixels = 32 pixels/clock <--> 32 ROPs
GK106 = 3 GPCs = 3*8 pixels = 24 pixels/clock <--> 24 ROPs

It may be asymetrical 7x2 + 1x1 in total 8 GPCs.. It would be similar to Fermi' fashion biggie has double setup compare to the middle sibling..

Under the above reasoning either 5 or 6 GPCs (either way there's something asymetrical after all) makes IMHO far more sense than 8.
 
Unless I've missed something the GK110 whitepaper doesn't clarify. However given that there are 15SMXs in total, wouldn't you suggest that 5 GPCs with 3SMXs each is a more "reasonable" layout?
The die shot clearly shows 6 so it has to be asymmetrical (or 1 for redundancy but that'd be very unexpected). One problem with symmetrical is that if you lose a GPC, you lose all the SMXs tied to that GPC as well...
 
The die shot clearly shows 6 so it has to be asymmetrical (or 1 for redundancy but that'd be very unexpected). One problem with symmetrical is that if you lose a GPC, you lose all the SMXs tied to that GPC as well...

How do you see GPCs on a die shot?
 
Back
Top