NVIDIA Kepler speculation thread

I bet they are comparing Kepler to the old (broken) GF100, not GF110, which is 20% more power efficient. Move Fermi 20% down in the chart and Kepler won't be 3× more efficient, only 2,5×.

Anyway, if I understand it well, these promises don't mean anything. Just like with Fermi - they promised some power consumption numbers before they had working silicon in their hands…

Undoubtably GF100. It's important to remember GF100 Fermi (Tesla/Quadro) were supposed be full 512sp running at around 750mhz. The actual products were much less, so temper that 3x claim accordingly.

The assumption I've been going off is their 'estimation' is relative to Tesla products. The current Tesla model is 448sp @ 575mhz. That said, I'm not saying a Fermi-like product w/ 768sp @ 1ghz in the same power envelope is necessarily the answer...FAIK they could be comparing against something in the 352sp Quadro line.

I think it's safe to assume the die will be smaller than 529mm2, hence better placed on the voltage/clock/per watt curve on 28nm than any product based on GF100 (or even GF110). There's also the RAM variable. If there's a 512-bit bus, we could be talking something as silly as a 2GB product being compared to a 6GB one based on Fermi. Also, while all mentions of low-power 1.2v GDDR5 (5ghz, probably meant to be based on binned 7gbps chips) seemed to have been scrubbed from Samsung's website (replaced with the typical 1.35v), if such a product (or similar) ends up existing, nVIDIA may very well use it. All of that, of course, effects something like 'DP performance per watt' without being directly related to core architecture improvements.

In other words, yeah...I don't put a whole lot of faith in that chart/quote. It could be contrived from any number of variables/guesstimates.
 
Ha, you guys really know how to hold a grudge.

So if they're sticking to their guns on 3x DP perf/w do we have any theories on how that's remotely possible without an accompanying increase in SP perf/w?
 
I think it's safe to assume the die will be smaller than 529mm2, hence better placed on the voltage/clock/per watt curve on 28nm than any product based on GF100 (or even GF110). There's also the RAM variable. If there's a 512-bit bus, we could be talking something as silly as a 2GB product being compared to a 6GB one based on Fermi. Also, while all mentions of low-power 1.2v GDDR5 (5ghz, probably meant to be based on binned 7gbps chips) seemed to have been scrubbed from Samsung's website (replaced with the typical 1.35v), if such a product (or similar) ends up existing, nVIDIA may very well use it. All of that, of course, effects something like 'DP performance per watt' without being directly related to core architecture improvements.
Don't forget about the "tradition" of Nvidia to ask TSMC straight for the maximum rectangle size, for their flagship SKU, and stuff it up to the last square mm. ;)
 
Ha, you guys really know how to hold a grudge.

So if they're sticking to their guns on 3x DP perf/w do we have any theories on how that's remotely possible without an accompanying increase in SP perf/w?
SP perf/watt won't be increased significantly? 40nm->28nm transition is one of the biggest jumps in GPU history, it allows to double transistor count and keep the die size at the same level. I believe, that Nvidia (and the same applies to AMD) won't double the count of TMUs, ROPs and memory controller.. Both companies should be able to increase transistor budget dedicated to unified core by fold of ~3× while keeping their die-size at the same level. It's hard to tell whether AMD will reduce the die-size to ~250 mm² or try to try to regain performance crown with GCN. But Nvidia is quite predictable I think. G80, GT200, GF100 - all of them were big GPUs, so ~500 mm² size is quite likely.
 
SP perf/watt won't be increased significantly? 40nm->28nm transition is one of the biggest jumps in GPU history, it allows to double transistor count and keep the die size at the same level. I believe, that Nvidia (and the same applies to AMD) won't double the count of TMUs, ROPs and memory controller.. Both companies should be able to increase transistor budget dedicated to unified core by fold of ~3× while keeping their die-size at the same level. It's hard to tell whether AMD will reduce the die-size to ~250 mm² or try to try to regain performance crown with GCN. But Nvidia is quite predictable I think. G80, GT200, GF100 - all of them were big GPUs, so ~500 mm² size is quite likely.
TMU's and ROPs already take much less area than the true ALU core. Which reduces any incremental area increase from rebalancing the chip much less.
 
40nm->28nm transition is one of the biggest jumps in GPU history, it allows to double transistor count and keep the die size at the same level.
All full node shrinks of the past did the same. Last instance was the 55nm->40nm transition (or all transitions intel and AMD ever made [before AMD went fabless]). And TSMCs 28nm process does actually a tiny bit worse than a factor of 2 in terms of density.
And while one still keeps the ~2x density for each full node shrink as in the past, one runs into issues with the power consumption lately. Power doesn't scale down per transistor at the same rate as the size does. In this respect HKMG offers some kind of improvement, as the leakage for a transistor with the same performance as a 40nm one goes down significantly.
I believe, that Nvidia (and the same applies to AMD) won't double the count of TMUs, ROPs and memory controller..
Memory controllers barely change in size as a process shrinks, so the area stays the same anyway. TMUs are still very important as they provide bandwidth to the ALUs. A further increase of the ALU-TEX-ratio will further bandwidth starve the execution cores. GCN will stay at 4 TMUs for 64 ALUs (which are more efficient as the Cayman ones, so that is an effective decrease), basically the same rate as Fermi/GF100/110 has (which nv deemed barely enough TEX capability as they scaled it up for GF104 and lower). I wouldn't expect Kepler to be below 8 TMUs for 64 hotclock ALUs (again the same ratio as GCN).
ROPs are basically tied to L2 cache/memory controllers anyway. Here I would agree that it probably won't be scaled by a factor of 2.
Both companies should be able to increase transistor budget dedicated to unified core by fold of ~3× while keeping their die-size at the same level.
It almost appears to me you underestimate the effort for shuffling around all the data on a GPU, supplying it to 32 (or something like that) CU/SMs with the same amount of L1 caches, the crossbars with that huge numbers of clients and so on. Without simplifying a bit on that (i.e. not just doubling or even tripling the unit count) those parts are growing faster than linear with the number of units. That's a reason, I would expect larger SMs for Kepler (to keep the number constant or only slightly rising) and it had probably also some influence on the decision to group together 4 CUs sharing an I$ and scalar D$ in GCN.
 
Isn't that precisely why nVidia hired Dally? For his experience designing efficient on-chip networks.

Not sure if that will bear fruit on Kepler or Maxwell but it would be interesting to see what they come up with.
 
Isn't that precisely why nVidia hired Dally? For his experience designing efficient on-chip networks.

The alternative would be for his experience in bankrupt startups, and I doubt Jen-Hsun has much use for that:smile: I'd expect Maxwell to be a place where the direction justifying having Dally in charge shows up, Kepler is slightly too early. Just a hunch though.
 
Nvidia rectified their slide-chaos.
Early silicon later in 2011 and shipping Kepler products in 2012. As the ISC slide said.
http://www.xbitlabs.com/news/graphi...nies_Plans_to_Release_Kepler_GPU_in_2011.html

"Although we will have early silicon this year, Kepler-based products are actually scheduled to go into production in 2012. We wanted to clarify this so people wouldn’t expect product to be available this year"

That's a bit worse than previously thought: they don't even have silicon now? And going into production in 2012 pretty much means launch in very late Q1 or some time in Q2.
 
So, good time to get some AMD stocks, no?

I'd say that will depend on Bulldozer more than Kepler. Still, if AMD can pull off a 6-month lead on 28nm GPUs again, they're bound to make a good bit of money on it.

Then again, if they face the same kind of volume limitations they did with 40nm, it might not be that much.
 
Do we believe Kepler to be the first NVIDIA 28nm GPU? I'd expect them to start with a >400mm² GPU aimed at both PCs and HPC for the Kepler family, and I'd also expect them to start with a <200mm² GPU for 28nm - something like the mysterious GF117 which I'd kinda expect to be a straight shrink of the GF116 except it'd only have a 128-bit memory bus (but with support for faster GDDR5). But I agree AMD's position on 28nm looks very good and if they execute on Bulldozer as well (which I'm more skeptical about to say the least) then they'll do well financially. That doesn't mean they'll be able to fully compensate for potential macroeconomic weakness though.
 
TMU's and ROPs already take much less area than the true ALU core. Which reduces any incremental area increase from rebalancing the chip much less.
As I remember, ALUs took less than 1/3 of RV770's core, while TMUs, ROPs, memory controllers and interface took 1/2 of the die. Cypress doubled the count of ALUs, TMUs and ROPs, so I wouldn't expect any significant change in these proportions. Cayman shrunk unified core by 10%, while number of TMUs was increased by 20%. How could the ALU core take more die area than TMUs / ROPs / MC then?
 
Last edited by a moderator:
As I remember, ALUs took less than 1/3 of RV770's core, while TMUs, ROPs, memory controllers and interface took 1/2 of the die. Cypress doubled the count of ALUs, TMUs and ROPs, so I wouldn't expect any significant change in these proportions. Cayman shrunk unified core by 10%, while number of TMUs was increased by 20%. How could the ALU core take more die area than TMUs / ROPs / MC then?
Yes, if I remember it right it was about 29% for the ALUs including the register files and ~42% of the die for ALUs+TMUs (i.e. ~13% for the TMUs alone).

And wasn't the 10% number as reduction for Cayman meant to describe the area of a single SIMD (or was it perf/area)? Remember that there were also 20% more SIMDs (so the number scaled the same as the number of the TMUs of course). Nevertheless, the ALU/TMU area ration went down with Cayman either way.
 
The RV770 areas I have:
Code:
mm²   unit
68.3  ALUs
7.7   ALU redundancy, LDS, etc.
31.5  TUs 
11.5  RBEs ?
14.4  L2s ?
12.4  MCs ?
6.3   PCI Express
2.4   Sideport
1.4   I/O 1 - CrossFire?
2.2   I/O 2 - Dual DVI?
16.6  Display + UVD logic
36.8  DDR interfacing
53.3  control, sequencer, setup, interpolation, tessellation etc
 
Thanks. But in this case it won't help us much. Because we don't know if they mean FLOPs or real-world performance.
 
Back
Top