It seems to me that Nvidia has been extremely conservative for the Kepler generation in pushing the thermal envelope. Much more so than for Fermi, which must have had a less advanced power management system. There must be plenty of leeway to kick things up a notch?
The turbo functionality introduced with Kepler is where at least some of the leeway was exploited.
I understand that AMD can regulate the power supply within microseconds, and that's impressive, but I have not idea what the practical consequences are of this. Temperature ramps are typically measured in seconds, not microseconds.
The thermal diodes used to feed into typical monitoring software measure ramps on the order of signficant fractions of a second. That's a value modified by the thermal capacity of the heat sink, the imprecision of the diodes, and conduction across the GPU over a pretty long period of time.
The instantaneous power draw and localized heating for spots in the logic can go up and down more quickly, particularly if other elements like fan speed ramp or cooler quality are constrained.
I have a hard time seeing how you can make a major difference by regulating a cause 3 orders of magnitude faster than the effect. How do you quantify this? Can you dial up the clock by 50MHz on average?
This may not be a particularly close proxy, given its age, but for a modeled CPU in an era of lower power density, there are localized temperature ramps that happen in hundredths of a second. There's an example of a 5-10 degree ramp that is basically over in .1 seconds.
http://www.irisa.fr/caps/people/fetis/hs.pdf
Guard-banding at 85C would be able to absorb such spikes and buy enough time for the sensors and controllers to react.
The probability of transients pushing parts of the chip past safety limits, or the impact of that temperature on reliability are other unknowns.
The 290's fixed temp range is in a band that most others try to avoid.
Possibly, the chip and substrate have been modified to better handle long periods at that range, or the controller's ability to keep the temp very constant allows fewer trips across transition temperatures for the package and underfill. These are temps where bumpgate and the RROD have come up before, so AMD would have been very aggressive about modeling it.
For all the smarts in the power management, it still puzzles me that R290 is an outlier compared not only Nvidia but also other AMD chips from a pure power consumption point of view.
The chip is still 40% bigger than a design that already was more than capable of hitting 300W.
Power management smarts can only go so far in taking what seems to be a less efficient architecture past a competitor.
The choice of 95C may have come as a tradeoff for reliability, binning, and performance. It seems pretty likely that power efficiency was hurt by this, since the chip will be operating at a sustained basis at high temps, and the thermal component to power draw is not linear.