NVIDIA Kepler speculation thread

If you need to guarantee that all 7970's run at 925Mhz in all games (deterministic) then of course you have to be conservative. nVidia doesn't guarantee that any 680 will run at 1110Mhz in any game on a given day of the week (non-deterministic).
If all 680's can boost up to 1100Mhz for at least some portion of time Nvidia must quality the parts to run at that speed. It seems the only thing they're not guaranteeing is how long/often each part can maintain peak boost.

B3D should do an interview about this feature to clear up misconceptions.
 
I was more curious about the end result. If I understand correctly Powertune runs off of a static table of mappings between utilization and clocks. Finding it hard to grasp how that's "more advanced" than real-time power and temperature monitoring.

Doesn't NV's approach have a higher response latency? What you probably want is a mix of the two -- monitoring of power draw and temp to indicate how the particular chip responds, and then internal utilization counters for faster response times. BWDIK
 
With GF104's dual-issue there was still intra-warp register dependency tracking in hardware. I was referring to the new compiler.
That's not the impression I got from the press coverage. Link?

Intra warp register dependence should be in the compiler, I think.
 
Doesn't NV's approach have a higher response latency? What you probably want is a mix of the two -- monitoring of power draw and temp to indicate how the particular chip responds, and then internal utilization counters for faster response times. BWDIK

Yes, 100ms (6 frames of latency at 60fps). Now is that too slow given the use case? Doubtful.
 
If you need to guarantee that all 7970's run at 925Mhz in all games (deterministic) then of course you have to be conservative. nVidia doesn't guarantee that any 680 will run at 1110Mhz in any game on a given day of the week (non-deterministic).
Determinism here is talking about keeping the performance and behaviours consistent across the range of chips for any given SKU variant. As 3dcgi says, you still have to actually qualify a product to run at a given speed that it may hit within the power scheme - there may be cases where the input power is so low the Boost scheme says "Ooo, I can run at 3GHz baby!!!" and obviously that isn't going to work; there is a cap on max clock the chip can run and a cap on the max voltage (which is probably going to be the reliability voltage for the process node).

I'm still not getting how "inferred power consumption based on chip utilization" is a whizzbang advanced approach. Sitting on the outside it simply looks like "guessing" versus the direct power consumption readings nVidia is doing. I understand the differences just not getting the "more advanced" part....
For starters, it is highly programmable, so we can program it to be deterministic or non-deteriministic on a per device ID basis if need be; additionally updates to the entire thing are just delivered via Catalyst. Seconly, we have per-MHz control over the clock and a fast sampling frequency. A fast sampling frequency, over time, is going to be less prone to error than external power sampling because of the lag of the feedback loop - NVIDIA have already talked about the guardband they've had to put on.
 
That's not the impression I got from the press coverage. Link?

Intra warp register dependence should be in the compiler, I think.

GF114, owing to its heritage as a compute GPU, had a rather complex scheduler. Fermi GPUs not only did basic scheduling in hardware such as register scoreboarding (keeping track of warps waiting on memory accesses and other long latency operations) and choosing the next warp from the pool to execute, but Fermi was also responsible for scheduling instructions within the warps themselves. While hardware scheduling of this nature is not difficult, it is relatively expensive on both a power and area efficiency basis as it requires implementing a complex hardware block to do dependency checking and prevent other types of data hazards.

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3
 
Determinism here is talking about keeping the performance and behaviours consistent across the range of chips for any given SKU variant. As 3dcgi says, you still have to actually qualify a product to run at a given speed that it may hit within the power scheme - there may be cases where the input power is so low the Boost scheme says "Ooo, I can run at 3GHz baby!!!" and obviously that isn't going to work; there is a cap on max clock the chip can run and a cap on the max voltage (which is probably going to be the reliability voltage for the process node).

nVidia might internally certify all 680's to run at 1110Mhz but they certainly aren't guaranteeing that speed for ANY use case. AMD is promising 925Mhz in pretty much all games. Big difference there.

For starters, it is highly programmable, so we can program it to be deterministic or non-deteriministic on a per device ID basis if need be; additionally updates to the entire thing are just delivered via Catalyst. Seconly, we have per-MHz control over the clock and a fast sampling frequency. A fast sampling frequency, over time, is going to be less prone to error than external power sampling because of the lag of the feedback loop - NVIDIA have already talked about the guardband they've had to put on.

Fair enough. While faster the Powertune algorithm is making an educated guess and is therefore more prone to error than empirical readings. As a consumer how do I know which approach is more accurate or effective in the end?
 
While faster the Powertune algorithm is making an educated guess and is therefore more prone to error than empirical readings.

Is it? How good are the sensors, how often are they taking readings?

It's quite possible you're correct, but not certain by any stretch. I've run into plenty of 'bad' empirical data in my life.
 
That seems more like GCN's scheduling than anything else to me.

If they have kept the register file bandwidth for 128 ALU's around (no superscalar, which seems to be the case as it's not dual issue, more like 1.5 issue) then they would need in-pipe registers. With that, in all probablity, scheduling and dependency tracking is in the compiler.

You've completely lost me. I was referring to hardware dependency tracking in GF104/114. Are you referring to GK104?

Also, what is 1.5x issue, are you referring to Techreport's report of 4x 16-wide + 4x 32-wide SIMDs per SMX? They're the only site reporting that craziness. All others are reporting 6x 32-wide SIMDs. No idea where they got that from.
 
Is it? How good are the sensors, how often are they taking readings?

It's quite possible you're correct, but not certain by any stretch. I've run into plenty of 'bad' empirical data in my life.

Any process based on inferences (as Dave confirmed it is) will be prone to estimation error. The only way you can get bad empirical data is to have faulty readings at the source. Since both nVidia's and AMD's approaches depend on sensors to provide input data they are subject to the same risks. However, AMD goes a step further and translates that sensor reading (utilization) into another metric (power consumption) and that's where the additional source of error is introduced.
 
You've completely lost me. I was referring to hardware dependency tracking in GF104/114. Are you referring to GK104?
Yes.

Also, what is 1.5x issue, are you referring to Techreport's report of 4x 16-wide + 4x 32-wide SIMDs per SMX? They're the only site reporting that craziness. All others are reporting 6x 32-wide SIMDs. No idea where they got that from.

That doesn't matter. There are 4 register file banks. SIngle issue would be satisfied by 128 ALUs ( 4 warps fed per clock, one from each reg file bank). For dual issue, you would need 256 ALUs. What do you think 192 ALUs is?
 
No. You're assuming the sensors are more accurate than the algorithm, this may or may not be true.

Not sure what you mean. The inputs to the PT algorithm come from sensors as well.

PowerTune: Utilization readings + algorithm -> derived power consumption.
GPU Boost: Current readings -> measured power consumption.
 
There are 4 register file banks.

Source?

Single issue would be satisfied by 128 ALUs ( 4 warps fed per clock, one from each reg file bank). For dual issue, you would need 256 ALUs. What do you think 192 ALUs is?

Instruction issue is done at the scheduler/dispatcher level. There's no requirement that all four schedulers dual-issue to the SIMDs every clock for the arch to be considered dual-issue. There can be SFU, L/S ops intermixed there too.
 
trinibwoy said:
Any process based on inferences (as Dave confirmed it is) will be prone to estimation error. The only way you can get bad empirical data is to have faulty readings at the source. Since both nVidia's and AMD's approaches depend on sensors to provide input data they are subject to the same risks. However, AMD goes a step further and translates that sensor reading (utilization) into another metric (power consumption) and that's where the additional source of error is introduced.
AMD doesn't depend on analog sensors (or at least not primarily, though I'm sure they have that too.) It depends on activity counters.

The granularity of that is unknown, of course, but the general principle is one of building a linear model that tries to estimate power as good as possible with a given set of digital inputs.

Those inputs would be things like: number of active ALU cycles vs the total number of clock cycles, the number of tex operations, the number of pipeline stalls etc.

In practice, some activity counters will be a better power predictor than other, so you try to build a model that assigns different weights to each counter and check which correlates best with a variety of workloads. If you do this right, you model should be a pretty good estimate of normalized power, which you then correct for various die specific silicon parameters.

So instead of analog measurement, you calculate it digitally. That's the deterministic part of it. In theory, you can calculate this every cycle. That's where the higher sampling come in. However, it suspect that most implementations will do this with a small microcontroller on die, for algorithmic flexibility. There's no need to do this every cycle after all.

A higher sample rate will allow one to react quicker when things really go wrong (as in: power virus), and use tighter guard bands, though it looks AMD has made full advantage of that for the 7970.
 
While faster the Powertune algorithm is making an educated guess and is therefore more prone to error than empirical readings.
In this scenario faster is better. While input reading will give a "actual" power reading, it is only for the point in time - by the time the GPU / software has reacted the power may be doing something very different, hence the large guardband NVIDIA have put on.
 
AMD doesn't depend on analog sensors (or at least not primarily, though I'm sure they have that too.) It depends on activity counters.

Ok, I'll buy that the input data is solid but that wasn't my primary concern.

In practice, some activity counters will be a better power predictor than other, so you try to build a model that assigns different weights to each counter and check which correlates best with a variety of workloads. If you do this right, you model should be a pretty good estimate of normalized power, which you then correct for various die specific silicon parameters.

Yup, it's the "if you do this right" part I was referring to with respect to accuracy of the model. Of course, we should assume that AMD knows what they're doing and the model comes close enough to reality for the use case.

In this scenario faster is better. While input reading will give a "actual" power reading, it is only for the point in time - by the time the GPU / software has reacted the power may be doing something very different, hence the large guardband NVIDIA have put on.

Understood. Hence my original question on whether "more advanced" was in reference to update frequency.
 
I'm still not getting how "inferred power consumption based on chip utilization" is a whizzbang advanced approach. Sitting on the outside it simply looks like "guessing" versus the direct power consumption readings nVidia is doing. I understand the differences just not getting the "more advanced" part....
Easy example:
your CPU fan spinning can be voltage-controlled or rpm-controlled. With the (not-deterministic) first approach you roughly know that you ramp up voltage in order to get around the desired spin. With the latter, you know where you head to, and which rpm you are reaching.
Complex example:
CPUs can OC themselves by using activity counters - that is, they 'measure' the activity of each 'block' of the CPU and, if such counters report a low utilization, the CPU can 'turn off' them, in order to gain speed. SB processors do this even to their internal FPU to save power - the first thousands FPU ops you do are 'slower' than the latter, since the FPU needs to ramp up.

Supposedly, the use of activity counters might mean they're in the process of adding a turbo-boost like feature (disable rarely used GCN/CU to OC the remaining ones), since GCN seems to scale better with Mhz than GCN count.
 
A higher sample rate will allow one to react quicker when things really go wrong (as in: power virus), and use tighter guard bands, though it looks AMD has made full advantage of that for the 7970.
AMD certainly has a way more accurate algorithm (or more sensibly placed sensors) in 7970 than in 6970. From what I've seen in my tests, there are actually only very few cases I can make our reference 7970 throttle and none of them occured in an actual gaming environment so far (though not all of them are limited to the falsely-so-called power viruses [in fact, calling Furmark or OCCT a power virus seems to have made up by marketing just to scare people away from them, since they actually miss one very important criterium for qualifying as a power virus: They stey within users control at all times - as wikipedia as of today says "Stability Test applications are similar programs which have the same effect as power viruses (high CPU usage) but stay under the user's control."].

;)

In this scenario faster is better. While input reading will give a "actual" power reading, it is only for the point in time - by the time the GPU / software has reacted the power may be doing something very different, hence the large guardband NVIDIA have put on.

So, I am not sure if AMD discloses this: Your activity counters, they use actual activity measures or do they use an extrapolation from analyzing the code that's supposed to get executed in a future cycle?
 
Back
Top