Nvidia BigK GK110 Kepler Speculation Thread

So, why Nvidia decided to double texturing power with GK104 (compared to GF114/110)? TMUs aren't really small (cheap) units. :smile:

You could eventually expand that question and ask why GK110 doesn't have just 8 instead of 16 TMUs/SMX.
 
So, why Nvidia decided to double texturing power with GK104 (compared to GF114/110)? TMUs aren't really small (cheap) units. :smile:

Two reasons:
First, it wasn't doubled compared to shader throughput in GF104/114 but kept at the same rate, since TMUs run at base frequency.

And second because they choose to increase shading power as well. TMUs are quite tightly coupled into the SMX, being grouped in quads and (together with a 64-kiB-part of the register file, it's 12 kiB texture cache and probably it's portion of the L1 cahce too) fixedly assigned to one of the warp schedulers.

So, there was not really an option to have less texturing in each SMX, coming from their building block design. It's rather a by-product of other design choices, I think.
 
Extrapolating GK110 desktop performance based on sterile unit amounts compared to GK104 is somewhat nonsense, since it would mean that there's no single difference between those two chips that could affect 3D performance. If you'd even have a corner case of 3D with a pinch of compute added to the mix it could get even more colourful.

I wonder where that GK110-flavoured magic sauce is? Apart from the larger L2-cache I really cannot grasp it.

There's:
• Hyper-Q - requires multiple concurrent threads at the host system to feed the GPU. Not in DX, there's a single queue built before dispatching to the driver.
• Dynamic Parallelism - you need Cuda-code tailored to this function to use it
• Load-path through texture cache
• 255 regs/thread - mainly useful for DGEMM
• Atomic Ops - not sure if applicable to gaming at all.

Apart from that, there's higher pixel- and triangle throughput, yes. But that's more or less in balance with higher throughput in other parts of the chip, nothing really "accelerating" GK110 beyond measure.

What did I miss?
 
I wonder where that GK110-flavoured magic sauce is? Apart from the larger L2-cache I really cannot grasp it.

There's:
• Hyper-Q - requires multiple concurrent threads at the host system to feed the GPU. Not in DX, there's a single queue built before dispatching to the driver.
• Dynamic Parallelism - you need Cuda-code tailored to this function to use it
• Load-path through texture cache
• 255 regs/thread - mainly useful for DGEMM
• Atomic Ops - not sure if applicable to gaming at all.

More bandwidth and probably slightly more usable bandwidth due to larger caches.

Apart from that, there's higher pixel- and triangle throughput, yes.

Errr no. Compared to ALU or texel throughput for instance pixel and triangle througput differences are way smaller. Assuming Titan is clocked north of 800MHz that's a 4-6% difference in triangle throughput, which is a moot point in any case because Lord knows how much geometry throughput will be strangled artificially in order to justify Quadro sales.

But that's more or less in balance with higher throughput in other parts of the chip, nothing really "accelerating" GK110 beyond measure.

What did I miss?

Alexco missed bandwidth or better the fillrate to bandwidth ratio to be more precise in his former estimates. Even if GK110 would contain anything that would theoretically accelerate it further from GK10x SKUs, would it be really an as much worthwhile hw investment given that they already have an already overblown transistor budget and the majority of bottlenecks in today's games lies were exactly if not bandwidth?
 
More bandwidth and probably slightly more usable bandwidth due to larger caches.

Errr no. Compared to ALU or texel throughput for instance pixel and triangle througput differences are way smaller. Assuming Titan is clocked north of 800MHz that's a 4-6% difference in triangle throughput, which is a moot point in any case because Lord knows how much geometry throughput will be strangled artificially in order to justify Quadro sales.

Alexco missed bandwidth or better the fillrate to bandwidth ratio to be more precise in his former estimates. Even if GK110 would contain anything that would theoretically accelerate it further from GK10x SKUs, would it be really an as much worthwhile hw investment given that they already have an already overblown transistor budget and the majority of bottlenecks in today's games lies were exactly if not bandwidth?

I didn't mention this because the GTX 680 is not particularly bandwidth-constrained, and because I suspect a GK110-based GeForce is likely to have slower memory, so perhaps a ~40% bandwidth improvement overall, which is pretty much in line with shader power. I think the fillrate should follow the same trend, but maybe I'm missing something.

This is not meant to be an estimate of actual performance, of course, just an upper bound. I don't expect the additional cache to have a significant impact on games, but I could be wrong about that.
 
Ailuros,
More bandwidth but only in total, compared to other throughput measures, it should not move much with GK110. The only real difference I see is the doubled L2. Pixel throughput could be as high as 60 ppc, raster rate could be 40 ppc, triangles could be at 7,5 tpc, scaling with number of SMX.

I don't really see where clock speed comes into play. That's only important when were comparing not architectures but SKUs.
 
I didn't mention this because the GTX 680 is not particularly bandwidth-constrained, and because I suspect a GK110-based GeForce is likely to have slower memory, so perhaps a ~40% bandwidth improvement overall, which is pretty much in line with shader power. I think the fillrate should follow the same trend, but maybe I'm missing something.

This is not meant to be an estimate of actual performance, of course, just an upper bound. I don't expect the additional cache to have a significant impact on games, but I could be wrong about that.

So to sum it up you're expecting only a quite small frequency increase to a K20X with even less bandwidth than the latter has. Don't tell me that you also expect to see a $899 MSRP for a solution like that :LOL:
 
Ailuros,
More bandwidth but only in total, compared to other throughput measures, it should not move much with GK110. The only real difference I see is the doubled L2. Pixel throughput could be as high as 60 ppc, raster rate could be 40 ppc, triangles could be at 7,5 tpc, scaling with number of SMX.

I don't really see where clock speed comes into play. That's only important when were comparing not architectures but SKUs.

When you hypothetically have 40 pixels/clock raster rate at say 850MHz vs. 32 pixels/clock raster rate at 1006MHz, is frequency really such an unimportant factor after all?
 
You where talking about "more bandwidth", Ail. This has to be put into relation to other resources, which increase as well. That's what I was talking about.

Clock rate is important for a SKU level comparison for sure. But we were talking about the magic sauce in GK110. Sorry, Ich komm mit deinen Gedankensprüngen gerade wirklich nicht mit. ;)

BTW - I like your 850 MHz number ;-)
 
Last edited by a moderator:
You where talking about "more bandwidth", Ail. This has to be put into relation to other resources, which increase as well. That's what I was talking about.

Clock rate is important for a SKU level comparison for sure. But we were talking about the magic sauce in GK110.

BTW - I like your 850 MHz number ;-)

I never said I expect or believe there's any magic sauce in GK110. Let me do some more dumb math:

GF110 vs. GF114:
GFLOPs = +25%
GTexels = -6%
MTris = +88%
GPixels = +31%
GB/s = +50%
(real time average performance difference ~42% and call me bold here but I wouldn't be surprised that if GF110 would have had quite a bit more fillrate the difference would had been closer to <50%)

GK110@850/1500MHz (theoretical) vs. GK104:
GFLOPs = +58%
GTexels = +58%
MTris = +6%
GPixels = +27%
GB/s = +50%

Yes those are naive peak numbers, but I still have a quite hard time believing that under those conditionals the difference between the latter two will be at only 30%. In fact if the real difference should be at 40-50% it's not real stunt for upcoming comparisons either; most likely it'll just mean that history repeats itself against the top dog single chip SKU of the competition.
 
So to sum it up you're expecting only a quite small frequency increase to a K20X with even less bandwidth than the latter has. Don't tell me that you also expect to see a $899 MSRP for a solution like that :LOL:

The K20X has only 30% more bandwidth than the GTX 680 (250GB/s vs. 192.2GB/s). So yes, I believe something like a 40% improvement for a GK110-based GeForce is reasonable.

http://www.nvidia.com/object/tesla-servers.html
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications
 
Big Kepler's magic sauce lies entirely in CUDA, HPC and Cloud Service professional markets. None of the advanced features translates into a useful capability for the end-user, running D3D or OGL applications, let alone OCL for witch NV didn't care to even think after in the last couple of years.

What is left is a big and expensive chip with a ton of unusable/disabled dedicated logic, in the hands of a small "enthusiast" elite benchamrking crowd... and a relatively cheap option for the few GPGPU coders and developers out there willing to shell money for it.
 
I never said I expect or believe there's any magic sauce in GK110.

I was basically referring to this part you posted originally: "If you'd even have a corner case of 3D with a pinch of compute added to the mix it could get even more colourful."

It's not especially you posting this, but I read this sentiment more and more over the last couple of days: GK110 is much better at [younameit] than GK104, because it is designed to do compute. NO. It is designed to be fast at DP and to profit from Hyper-Q and Dynamic Parlallelism, both of which don't do squat without proper code.

Let me do some more dumb math:

GF110 vs. GF114:
GFLOPs = +25%
GTexels = -6%
MTris = +88%
GPixels = +31%
GB/s = +50%
(real time average performance difference ~42% and call me bold here but I wouldn't be surprised that if GF110 would have had quite a bit more fillrate the difference would had been closer to <50%)

GK110@850/1500MHz (theoretical) vs. GK104:
GFLOPs = +58%
GTexels = +58%
MTris = +6% [it'd be +58% actually]
GPixels = +27%
GB/s = +50%

Yes those are naive peak numbers, but I still have a quite hard time believing that under those conditionals the difference between the latter two will be at only 30%. In fact if the real difference should be at 40-50% it's not real stunt for upcoming comparisons either; most likely it'll just mean that history repeats itself against the top dog single chip SKU of the competition.

There's one trick GF110 (and GF100 for that matter) did not have up it's sleeve: It did not use dual issue in/after it's warp schedulers, but more fine grained control logic. thus not relying on extracting ILP for maximum utilization. I firmly believe that quite a bit of GF110's higher performance compared to GF114 is coming from this and not all is attributable to higher bandwidth. As a small hint I take the results from my earlier experiment over here - where GF114 would be characterized a little similar to HD 7970, i.e. not behaving as "scalarly" as GF110.

In Kepler, there's no such difference.
 
Last edited by a moderator:
CarstenS said:
I firmly believe that quite a bit of GF110's higher performance compared to GF114 is coming from this and not all is attributable to higher bandwidth.
I would wager it is mostly the bandwidth.
 
A bit off-topic but here's a little more evidence of Kepler's weakness in general compute. CUDA accelerated raytracing in Adobe After Effects.

http://www.legitreviews.com/article/2127/1/

raytracing.png

Kepler's consumer variants maybe. GK110 is an entirely different matter.
 
Kepler's consumer variants maybe. GK110 is an entirely different matter.

I've seen people having that intuition, and I totally believe GK110 will look bad too.
- it's graphics, all in FP32, no FP64
- it would be running the same code : not made for the new features
unless maybe HyperQ (and nothing else) can be exploited transparently, with some driver magic, I don't know.
Doubled L2 is another significant factor that may help the GK110 (256K per memory controller instead of 128K)

Else, my theory is the code is particularly optimised for Fermi and so you would need new code for Kepler to match or outperform it.
 
Back
Top