NVIDIA Kepler speculation thread

[strike]Does it only apply to the GF104 style SMs with 48 ALUs or all of them?[/strike] (note to myself: refresh tabs open for an hour before you are posting).
Either way, could it be some kind of a shortcut in the scheduler to buy it more time (2 cycles at base clock instead of a single one) to evaluate the scoreboard?!?
 
Maybe the register file should have a bypass network to allow you to use the appropriate power for the necessary latency?

Need a register right now or there won't be a work group to execute? Throw it over the long wires.

Have plenty of work groups ready to go? When the memory accesses for another work group finish you throw it's working set on a switched bus to make a local copy before you start running it.
 
http://www.fudzilla.com/graphics/item/23247-kepler-28nm-taped-out

Works well The successor to Nvidia’s Fermi architecture, a certain GPU that goes by the name Kepler has already been taped out.

We have multiple sources to confirm that the new 28nm chip is alive and that it looks quite well. Nvidia didn't have any major obstacles with the first tape out, but naturally there is still a lot of work to seal the leakage, and make the chip generally better. This is nothing unusual for a new chip, and getting from 40nm to 28nm is definitely not a walk in a park.

Kepler naturally gets a lot of changes to Nvidia's GPU and you can imagine that with the same maximal TDP, you can squeeze significantly more transistor and get much more power.

The optimistic projection is to see Kepler in retail in Q4 2011 but some sources suggest that this won't actually happen this year, as TSMC needs a bit more time to make the 28nm more mature.

Current generation GPUs are already hitting the performance wall at 40nm and the industry has a lot riding on the 28nm process, but fear not. For the moment, both AMD and Nvidia seem to be on track and we might see the first 28nm chips by the end of the year, but mass availability comes a bit later.

Yawn...wake me up when we have some actual news :rolleyes:
 
Well, he did say that Kepler has already taped-out. Of course he could be wrong or making it up, but that is technically a piece of information! :D
 
All this work for a grand total of 3.6% reduction in power consumption plus an increase in power consumption due to register cache flushes. :)

http://cva.stanford.edu/publications/2011/gebhart-isca-2011.pdf

You seem to have placed a decimal point where there wasn't one before.

36% is stated not the 3.6% you quote.

We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively.

We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.
 
You seem to have placed a decimal point where there wasn't one before.

36% is stated not the 3.6% you quote.
Yes, for the consumption of the register file alone. Somewhere else they write that the register file contributes about 10% to the total power consumption of the entire GPU (I think the example was a GTX260 or something like that). ;)
 
My bad, it was a GTX280 instead of a GTX260:
that paper said:
Prior work examining a previous generation NVIDIA GTX280 GPU (which has 64 KB of register file storage per SM), estimates that nearly 10% of total GPU power is consumed by the register file [16].
 
http://semiaccurate.com/2011/07/05/nvidias-kepler-comes-in-to-focus/

It's the usual what you would expect when Charlie writes about anyhting nVidia related, so be warned if you decide to read the whole, but there are some intersting points:

SemiAccurate has heard two things about Kepler, the first is that the chip is heavily skewed towards HPC/compute at a commensurate areal cost to graphics. The other bit is power management is still nowhere near what AMD had for Evergreen/HD5xxx, much less Northern Islands/HD6xxx. Coupled with the earlier bounding boxes, we can safely say that the Kepler chips will very likely have full rate DP performance, coupled with less than a 50% increase in shader counts, and a clock in the neighborhood that Fermi is now.

That would be big news if true. And virtualy abandoning the desktop graphics market?
 
I'm curious what the cost-benefit analysis would be for going full-rate DP, which in the context of current designs is the same thing as saying half-rate SP.
 
Full rate DP just makes no sense. Not only do you need beefier alus, but twice the register load and store bandwidth (so yes that is really more like half-rate SP instead of full rate DP).
CPUs don't do this neither for much the same reasons (well ok x87 did).
Quarter-rate DP makes sense because you can do it very cheap with alus designed for SP (but you "waste" half the register bandwidth). Half-rate DP makes sense because you only need to extend the execution units (especially the multipliers) but load and store basically remains the same. But full-rate DP is just insane - the cost should be nearly as high as a half-rate DP design with twice the number of SP flops. So, even if you want to focus on DP performance, it just doesn't make sense (unless you want to focus _exclusively_ on DP performance, but why the heck would you want to do that???).
 
I guess Charlie is reffering to this:

NV_roadmap_small.png


Since they are already at the 300W wall and adding more than twice the ALUs is rather unlikely, if they want to keep up with this roadmap they have to do full rate DP. :???:

Besides, GCN will be halfrate DP, correct? And the SP rate is estimated at over 3.2 TFLOPs, right? If they want to compete with that in raw power, there might be no other choice...
 
[Regarding GCN, I think the DP ratio is configurable]

I think I remember Dally making some comments a while back to the effect that a 0.5 SP/DP ratio was too high, but it's been a while.

Still, going with the rumor for a second - a full-rate DP design would let you cut your peak issue rate in half (relative to a design with the same DP rate, but 2x SP rate). Maybe you'd need less register file ports as a result? I'm really not a hardware guy, so would be interested to hear what could be simplified by such a design.

Or could it be something along these lines?
- You're power and not transistor limited
- Within the power budget, you can do X SP flops with graphics FF hardware at typical gaming utilization levels
- When the FF hardware is completely idle (HPC), your power budget also covers X DP flops. If you've got transistors to burn, maybe it makes sense to include the ALUs for full-rate DP, even if they're powered down most of the time.
 
The only thing I am buying from that article is that Kepler taped out 4 months later.

Full rate DP, only 1.5x more ALUs and the like are on ignore.

EDIT:
These last two paragraphs are starting to lay the foundation for finger pointing at TSMC long before the chip has been shown off publicly. You really have to ask yourself why that is necessary six months or more before launch?
This is worth pondering over.
 
Back
Top