NVIDIA Kepler speculation thread

What are the odds of GK104 actually being branded the "680"? If it is we could be looking at a very situation to the 9800GTX.

- Marginally faster over prior high-end and slower under certain conditions.
- Lower bandwidth.
- New process
- $299

Things are very different now than they were back then though. nVidia had no single card competition and the 9800 GX2 was already out too. It's hard to believe GK104 could produce enough of a performance increase over the 580 to earn the 680 name. Even the "broken" GTX 480 was 50-60% faster than the 280 + DX11.

It could simply be NVIDIA's answer to AMD calling Barts the 6800 series, Cayman 6900, and so forth.

an announcement that there will be an announcement about when the product will be announced?

Come on! Would NVIDIA ever do that? :D
 
It could simply be NVIDIA's answer to AMD calling Barts the 6800 series, Cayman 6900, and so forth.
Which is ironic, considering apparently moving Barts to 6800 was triggered by nV using "GTX" (highend naming) on midrange 460 (and then 560) :D
 
Which is ironic, considering apparently moving Barts to 6800 was triggered by nV using "GTX" (highend naming) on midrange 460 (and then 560) :D

Just give it a few generations, all SKUs from bottom- to top-end will be called HD N950 to HD N999 on AMD's side and GTX N90 to GTX N99 on NVIDIA's… :p
 
I thought April was the magic date with GK110 only following September or so?

For the record there's no such thing as enough bandwidth on any GPU out there.
R600?

Meanwhile, I'm still trying to figure out how nvidia could fit 4 GPC / 16 SMs / 1536 alus on a ~350mm² chip.
I think it's doable though. Essentially the SMs would be "GF104" style, just instead of 3x16 shader alus they'd be 3x32 to compensate for no hot clock (from a scheduling point of view nothing would actually change). That certainly would make the SMs somewhat larger, so it seems somewhat unlikely it would fit on a chip the same size as GF104/114 (granted ROPs wouldn't double up but almost everything else). There's another trick nvidia could do though, that is to eliminate the SFU and integrate that functionality into the normal shader alus (which amd did too), that should save some transistors. I don't think separate SFU really make sense any longer (and I'm not sure it did for Fermi neither).
 
Last edited by a moderator:
If it's faster than tahiti. Why price it at 299 USD when you can achieve the same effect at 399 USD? That's still hugely undercutting your competition. Assuming die size is similar and that 1 GB of GDDR5 isn't 100 USD or more. You'd still be eating into your margins.

The scenario I laid out doesn't have it beating Tahiti.

So, IMO, either the rumored price is wrong or the rumored performance is wrong.

Exactly.
 
Yeah, it is definitely hard to imagine. I'm wondering if they'll be showing Kepler off at PDXLAN 19... their site says, as of this morning....

"*** SPECIAL PDXLAN 19 ALERT!!! ***
Monday, 13 February 2012

Monday, 13 February 2012 NVIDIA and Gearbox are bringing something really special to PDXLAN 19.

Attendees are in for an exclusive treat that will blow your mind!!



NVIDIA and Gearbox are bringing something really special to PDXLAN 19.

Attendees are in for an exclusive treat that will blow your mind!!"

Physx support for Borderlands 2? Yikes!
 
What exactly is it btw, that keeps GPUs from reaching frequencies that CPUs have been more or less comfortable at for years, i.e. 3 GHz-ish?

IF - i know that's a big if - Nvidia would manage to triple or even quadruple (more options, probably for mobile with reduced voltages) hot clock from a common base clock, they could reach the performance of a 1536-ALU-part (non hot-clocked) with 512 ALUs and would have easier routing throughout the chip, probably less overhead for instruction scheduling and what-else-not.

Plus, if leakage and variance still is a big thing on 28nm which apparently it was at least at the start of production, they'd have fewer transistors, leaking proportionally less.
 
My guess would be that the GPUs lack: proper foundry processes, a fair share of custom logic, complex & deep pipelines.

Probably, higher power consumption would also be a key here.
 
It makes little to no difference to me whether you consider the GF114 or the GF110 a more balanced design.

I'm fairly sure you asked me about the GF110 compared to the GF114, but perhaps I was mistaken
;)

The case in point was an indirect parallel between performance and high end SKUs in the Fermi family and how there could be any comparable sitings in GK104 or in extension to that between performance and high end Kepler. For the record there's no such thing as enough bandwidth on any GPU out there.

In which case, I think the GK110 is likely to end up just as bw constrained as the GK104 (proportionately)
- assuming it has 50% more SPs
(AFAIK, the current rumor is for 6 GPCs compared to the 4 GPCs on the GK104)
- and the currently rumored 384-bit bus gives it 50% more bw...

unless it has a 512-bit bus...
 
The big problem with running a GPU at CPU level frequencies is that a high end GPU die has easily 5 times as much surface area than a CPU die. Combine that with the fact that 50% of a CPU is cache, which uses less power/ generates less heat than active logic, whereas a GPU has most of its area dedicated to active logic. Combine those two, and you can see that a GPU at CPU like clocks would consume an order of magnitude more power and generate that much more heat. Ouch.

Basically, the laws of physics dictate that doubling core count uses less energy, therefore generating less heat, than doubling frequency. Since GPUs are pretty much at the energy consumption limit, at least without requiring exotic cooling methods, it's more efficient, since GPU programming expects very high levels of parallelism, to increase core count than frequency to stay within the power budget.
 
What exactly is it btw, that keeps GPUs from reaching frequencies that CPUs have been more or less comfortable at for years, i.e. 3 GHz-ish?

IF - i know that's a big if - Nvidia would manage to triple or even quadruple (more options, probably for mobile with reduced voltages) hot clock from a common base clock, they could reach the performance of a 1536-ALU-part (non hot-clocked) with 512 ALUs and would have easier routing throughout the chip, probably less overhead for instruction scheduling and what-else-not.

Plus, if leakage and variance still is a big thing on 28nm which apparently it was at least at the start of production, they'd have fewer transistors, leaking proportionally less.

Dynamic power would explode. Plus, you'd need much deeper pipelines to reach such clocks, which means additional stages, so more transistors and in the end, leakage wouldn't decrease by that much. In fact, you'd need pretty high voltage too, so it might not decrease at all.
 
Essentially the same reason why neither Bobcat nor Atom reaches anywhere close to 3Ghz, it's just not power efficient. Both go for dual core with "half the frequency" instead - and that is with a piece of silicon where single-thread performance still matters. Higher frequencies might also require custom logic (which at least bobcat doesn't have neither).
 
So, general consensus here seems to be that it's got nothing to do with graphics related functions, but rather with power issues.
 
Power has been the ultimate limiter for the last few generations, and it's getting worse.

Also, it's not like pushing timings closer to the bleeding edge makes a circuit more resistant to device variation and interference.
There are inflection points where the required voltage and area needed for a high-speed pipeline make density and power sacrifices far higher than the actual gains, especially if you move towards the upper end of what the process tech can support.

Let's note that CPUs with their limited core counts and aggressive turbo implementations toe that line as a matter of routine. A few hundred MHz, or at most a few tens of percent extra throughput can take a chip running at a fraction of its TDP to way over.

The density figures for high-speed logic are not that great, and the scaling across nodes is noticeably worse than density-optimized portions of the chip like memory.
 
What exactly is it btw, that keeps GPUs from reaching frequencies that CPUs
GPU-s have 4 times more transistors than CPU-s (link) and it is easier to achieve higher "whole package" utilization of GPUs than CPU-s. Take for example FurMark and Linpack - both are high-load benchmarks, but FurMark causes much more trouble.

GPUs already have the TDP 3 times higher than CPUs, having frequency 3 times lower. I'm sure there are refinements that could be made to GPUs, but so far it seems that the way to greater performance comes from rearchitecting the chip, not optimizing existing GPUs to run with higher frequencies.
 
Back
Top