Nvidia BigK GK110 Kepler Speculation Thread

Ailuros · Jan 29, 2013

no-X said:
So, why Nvidia decided to double texturing power with GK104 (compared to GF114/110)? TMUs aren't really small (cheap) units. :smile:

You could eventually expand that question and ask why GK110 doesn't have just 8 instead of 16 TMUs/SMX.

CarstenS · Jan 29, 2013

no-X said:
So, why Nvidia decided to double texturing power with GK104 (compared to GF114/110)? TMUs aren't really small (cheap) units. :smile:

Two reasons:
First, it wasn't doubled compared to shader throughput in GF104/114 but kept at the same rate, since TMUs run at base frequency.

And second because they choose to increase shading power as well. TMUs are quite tightly coupled into the SMX, being grouped in quads and (together with a 64-kiB-part of the register file, it's 12 kiB texture cache and probably it's portion of the L1 cahce too) fixedly assigned to one of the warp schedulers.

So, there was not really an option to have less texturing in each SMX, coming from their building block design. It's rather a by-product of other design choices, I think.

CarstenS · Jan 29, 2013

Ailuros said:
Extrapolating GK110 desktop performance based on sterile unit amounts compared to GK104 is somewhat nonsense, since it would mean that there's no single difference between those two chips that could affect 3D performance. If you'd even have a corner case of 3D with a pinch of compute added to the mix it could get even more colourful.

I wonder where that GK110-flavoured magic sauce is? Apart from the larger L2-cache I really cannot grasp it.

There's:
• Hyper-Q - requires multiple concurrent threads at the host system to feed the GPU. Not in DX, there's a single queue built before dispatching to the driver.
• Dynamic Parallelism - you need Cuda-code tailored to this function to use it
• Load-path through texture cache
• 255 regs/thread - mainly useful for DGEMM
• Atomic Ops - not sure if applicable to gaming at all.

Apart from that, there's higher pixel- and triangle throughput, yes. But that's more or less in balance with higher throughput in other parts of the chip, nothing really "accelerating" GK110 beyond measure.

What did I miss?

Ailuros · Jan 29, 2013

CarstenS said:
I wonder where that GK110-flavoured magic sauce is? Apart from the larger L2-cache I really cannot grasp it.

There's:
• Hyper-Q - requires multiple concurrent threads at the host system to feed the GPU. Not in DX, there's a single queue built before dispatching to the driver.
• Dynamic Parallelism - you need Cuda-code tailored to this function to use it
• Load-path through texture cache
• 255 regs/thread - mainly useful for DGEMM
• Atomic Ops - not sure if applicable to gaming at all.

More bandwidth and probably slightly more usable bandwidth due to larger caches.

Apart from that, there's higher pixel- and triangle throughput, yes.

Errr no. Compared to ALU or texel throughput for instance pixel and triangle througput differences are way smaller. Assuming Titan is clocked north of 800MHz that's a 4-6% difference in triangle throughput, which is a moot point in any case because Lord knows how much geometry throughput will be strangled artificially in order to justify Quadro sales.

But that's more or less in balance with higher throughput in other parts of the chip, nothing really "accelerating" GK110 beyond measure.

What did I miss?

Alexco missed bandwidth or better the fillrate to bandwidth ratio to be more precise in his former estimates. Even if GK110 would contain anything that would theoretically accelerate it further from GK10x SKUs, would it be really an as much worthwhile hw investment given that they already have an already overblown transistor budget and the majority of bottlenecks in today's games lies were exactly if not bandwidth?

Alexko · Jan 29, 2013

Ailuros said:
More bandwidth and probably slightly more usable bandwidth due to larger caches.

Errr no. Compared to ALU or texel throughput for instance pixel and triangle througput differences are way smaller. Assuming Titan is clocked north of 800MHz that's a 4-6% difference in triangle throughput, which is a moot point in any case because Lord knows how much geometry throughput will be strangled artificially in order to justify Quadro sales.

Alexco missed bandwidth or better the fillrate to bandwidth ratio to be more precise in his former estimates. Even if GK110 would contain anything that would theoretically accelerate it further from GK10x SKUs, would it be really an as much worthwhile hw investment given that they already have an already overblown transistor budget and the majority of bottlenecks in today's games lies were exactly if not bandwidth?

I didn't mention this because the GTX 680 is not particularly bandwidth-constrained, and because I suspect a GK110-based GeForce is likely to have slower memory, so perhaps a ~40% bandwidth improvement overall, which is pretty much in line with shader power. I think the fillrate should follow the same trend, but maybe I'm missing something.

This is not meant to be an estimate of actual performance, of course, just an upper bound. I don't expect the additional cache to have a significant impact on games, but I could be wrong about that.

CarstenS · Jan 29, 2013

Ailuros,
More bandwidth but only in total, compared to other throughput measures, it should not move much with GK110. The only real difference I see is the doubled L2. Pixel throughput could be as high as 60 ppc, raster rate could be 40 ppc, triangles could be at 7,5 tpc, scaling with number of SMX.

I don't really see where clock speed comes into play. That's only important when were comparing not architectures but SKUs.

Ailuros · Jan 29, 2013

Alexko said:
I didn't mention this because the GTX 680 is not particularly bandwidth-constrained, and because I suspect a GK110-based GeForce is likely to have slower memory, so perhaps a ~40% bandwidth improvement overall, which is pretty much in line with shader power. I think the fillrate should follow the same trend, but maybe I'm missing something.

This is not meant to be an estimate of actual performance, of course, just an upper bound. I don't expect the additional cache to have a significant impact on games, but I could be wrong about that.

So to sum it up you're expecting only a quite small frequency increase to a K20X with even less bandwidth than the latter has. Don't tell me that you also expect to see a $899 MSRP for a solution like that

Ailuros · Jan 29, 2013

CarstenS said:
Ailuros,
More bandwidth but only in total, compared to other throughput measures, it should not move much with GK110. The only real difference I see is the doubled L2. Pixel throughput could be as high as 60 ppc, raster rate could be 40 ppc, triangles could be at 7,5 tpc, scaling with number of SMX.

I don't really see where clock speed comes into play. That's only important when were comparing not architectures but SKUs.

When you hypothetically have 40 pixels/clock raster rate at say 850MHz vs. 32 pixels/clock raster rate at 1006MHz, is frequency really such an unimportant factor after all?

CarstenS · Jan 29, 2013

You where talking about "more bandwidth", Ail. This has to be put into relation to other resources, which increase as well. That's what I was talking about.

Clock rate is important for a SKU level comparison for sure. But we were talking about the magic sauce in GK110. Sorry, Ich komm mit deinen Gedankensprüngen gerade wirklich nicht mit.

BTW - I like your 850 MHz number ;-)

Ailuros · Jan 29, 2013

CarstenS said:
You where talking about "more bandwidth", Ail. This has to be put into relation to other resources, which increase as well. That's what I was talking about.

Clock rate is important for a SKU level comparison for sure. But we were talking about the magic sauce in GK110.

BTW - I like your 850 MHz number ;-)

I never said I expect or believe there's any magic sauce in GK110. Let me do some more dumb math:

GF110 vs. GF114:
GFLOPs = +25%
GTexels = -6%
MTris = +88%
GPixels = +31%
GB/s = +50%
(real time average performance difference ~42% and call me bold here but I wouldn't be surprised that if GF110 would have had quite a bit more fillrate the difference would had been closer to <50%)

GK110@850/1500MHz (theoretical) vs. GK104:
GFLOPs = +58%
GTexels = +58%
MTris = +6%
GPixels = +27%
GB/s = +50%

Yes those are naive peak numbers, but I still have a quite hard time believing that under those conditionals the difference between the latter two will be at only 30%. In fact if the real difference should be at 40-50% it's not real stunt for upcoming comparisons either; most likely it'll just mean that history repeats itself against the top dog single chip SKU of the competition.

Alexko · Jan 29, 2013

Ailuros said:
So to sum it up you're expecting only a quite small frequency increase to a K20X with even less bandwidth than the latter has. Don't tell me that you also expect to see a $899 MSRP for a solution like that

The K20X has only 30% more bandwidth than the GTX 680 (250GB/s vs. 192.2GB/s). So yes, I believe something like a 40% improvement for a GK110-based GeForce is reasonable.

http://www.nvidia.com/object/tesla-servers.html
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications

Ailuros · Jan 29, 2013

Alexko said:
The K20X has only 30% more bandwidth than the GTX 680 (250GB/s vs. 192.2GB/s). So yes, I believe something like a 40% improvement for a GK110-based GeForce is reasonable.

http://www.nvidia.com/object/tesla-servers.html
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-680/specifications

Damn and I recalled it has 288GB/s for whatever reason

lanek · Jan 29, 2013

Ailuros said:
Damn and I recalled it has 288GB/s for whatever reason

Maybe cause it is the bandwith of the 7970Ghz lol. ?

Ailuros · Jan 29, 2013

lanek said:
Maybe cause it is the bandwith of the 7970Ghz lol. ?

No just one of my typical brainfarts; a couple of days I gave the GTX680 just a 1250MHz memory frequency.

fellix · Jan 29, 2013

Big Kepler's magic sauce lies entirely in CUDA, HPC and Cloud Service professional markets. None of the advanced features translates into a useful capability for the end-user, running D3D or OGL applications, let alone OCL for witch NV didn't care to even think after in the last couple of years.

What is left is a big and expensive chip with a ton of unusable/disabled dedicated logic, in the hands of a small "enthusiast" elite benchamrking crowd... and a relatively cheap option for the few GPGPU coders and developers out there willing to shell money for it.

CarstenS · Jan 29, 2013

Ailuros said:
I never said I expect or believe there's any magic sauce in GK110.

I was basically referring to this part you posted originally: "If you'd even have a corner case of 3D with a pinch of compute added to the mix it could get even more colourful."

It's not especially you posting this, but I read this sentiment more and more over the last couple of days: GK110 is much better at [younameit] than GK104, because it is designed to do compute. NO. It is designed to be fast at DP and to profit from Hyper-Q and Dynamic Parlallelism, both of which don't do squat without proper code.

Ailuros said:
Let me do some more dumb math:

GF110 vs. GF114:
GFLOPs = +25%
GTexels = -6%
MTris = +88%
GPixels = +31%
GB/s = +50%
(real time average performance difference ~42% and call me bold here but I wouldn't be surprised that if GF110 would have had quite a bit more fillrate the difference would had been closer to <50%)

GK110@850/1500MHz (theoretical) vs. GK104:
GFLOPs = +58%
GTexels = +58%
MTris = +6% [it'd be +58% actually]
GPixels = +27%
GB/s = +50%

Yes those are naive peak numbers, but I still have a quite hard time believing that under those conditionals the difference between the latter two will be at only 30%. In fact if the real difference should be at 40-50% it's not real stunt for upcoming comparisons either; most likely it'll just mean that history repeats itself against the top dog single chip SKU of the competition.

There's one trick GF110 (and GF100 for that matter) did not have up it's sleeve: It did not use dual issue in/after it's warp schedulers, but more fine grained control logic. thus not relying on extracting ILP for maximum utilization. I firmly believe that quite a bit of GF110's higher performance compared to GF114 is coming from this and not all is attributable to higher bandwidth. As a small hint I take the results from my earlier experiment over here - where GF114 would be characterized a little similar to HD 7970, i.e. not behaving as "scalarly" as GF110.

In Kepler, there's no such difference.

ninelven · Jan 29, 2013

CarstenS said:
I firmly believe that quite a bit of GF110's higher performance compared to GF114 is coming from this and not all is attributable to higher bandwidth.

I would wager it is mostly the bandwidth.

I.S.T. · Jan 30, 2013

trinibwoy said:
A bit off-topic but here's a little more evidence of Kepler's weakness in general compute. CUDA accelerated raytracing in Adobe After Effects.

http://www.legitreviews.com/article/2127/1/

Kepler's consumer variants maybe. GK110 is an entirely different matter.

CarstenS · Jan 30, 2013

I.S.T. said:
Kepler's consumer variants maybe. GK110 is an entirely different matter.

And why might that be?

Blazkowicz · Jan 30, 2013

I.S.T. said:
Kepler's consumer variants maybe. GK110 is an entirely different matter.

I've seen people having that intuition, and I totally believe GK110 will look bad too.
- it's graphics, all in FP32, no FP64
- it would be running the same code : not made for the new features
unless maybe HyperQ (and nothing else) can be exploited transparently, with some driver magic, I don't know.
Doubled L2 is another significant factor that may help the GK110 (256K per memory controller instead of 128K)

Else, my theory is the code is particularly optimised for Fermi and so you would need new code for Kepler to match or outperform it.

Nvidia BigK GK110 Kepler Speculation Thread

Ailuros

Epsilon plus three

CarstenS

Moderator

CarstenS

Moderator

Ailuros

Epsilon plus three

Alexko

CarstenS

Moderator

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

CarstenS

Moderator

Ailuros

Epsilon plus three

Alexko

Ailuros

Epsilon plus three

lanek

Ailuros

Epsilon plus three

fellix

CarstenS

Moderator

ninelven

PM

I.S.T.

CarstenS

Moderator

Blazkowicz

Similar threads