NVIDIA GF100 & Friends speculation

x86 has the concept of per-core power planes with independent voltage control.
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.
 
In a not distributed architecture, without GPCs, and TMU's attached to the SMs, there would be just a single TMU structure right?
No. Most GPUs have the TMUs close to the ALUs they serve. Texture data is mostly very highly compressed. After filtering, it has expanded enormously. So it makes sense to have the texture data travel in compressed form as far as possible - to get texture data as close to the ALUs as possible while using the least bandwidth.

After that the issue is then that ALU:TEX in shaders implies a relatively low TMU throughput, hence a degree of TMU sharing amongst ALUs.

Xenos is the closest to your proposal, theoretically.
 
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.
Whoops I wrongly attributed this to power planes - the technology in Nehalem is actually power gating:

http://www.anandtech.com/show/2594/12

which is much easier to implement. So for NVidia to implement per-core voltage control would be very difficult, and so I shouldn't really have brought it up as a factor.

Jawed
 
No. Most GPUs have the TMUs close to the ALUs they serve. Texture data is mostly very highly compressed. After filtering, it has expanded enormously. So it makes sense to have the texture data travel in compressed form as far as possible - to get texture data as close to the ALUs as possible while using the least bandwidth.

After that the issue is then that ALU:TEX in shaders implies a relatively low TMU throughput, hence a degree of TMU sharing amongst ALUs.

Xenos is the closest to your proposal, theoretically.

Thanks. And about power consumption? Making it more the GF100 way, wouldnt it have any effect on power consumption?
 
Bandwidth costs power.

What's notable about Fermi architecture is that supposedly a variety of local buffers have been removed in favour of buffering through the L1/L2 cache hierarchy. This is a generalisation applied to the producer-consumer nature of a pipelined parallelising throughput processor. This generalisation allows greater flexibility in workloads. e.g. it should lead to higher minimum frame-rates (buffers shouldn't run out of space halting the producer that feeds into them). But it means that some kinds of data are moving further in this architecture than in prior GPUs.

Parallel setup across 4 GPCs in GF100 is one task that creates a power overhead, because there's 2-way synchronisation of workload in order to allow setup to function in a distributed fashion.

Overall, though, NVidia's architecture is not particularly power-efficient. Though problems with older chips (reliance upon a surfeit of TMUs and ROPs that never deliver their rated throughputs) have been tackled quite effectively. Fermi's current TMUs are doing very well as far as gaming is concerned, even if absolute throughput isn't exciting.

e.g. the memory controllers are placed at the centre of the die, as far as possible from the I/O pads that communicate with memory. In ATI the memory controllers are placed as close as possible to the pads (admittedly we've only seen RV770's die). It seems to me NVidia does this because the ALUs, consuming lots of power, need to be spread out across as much area, to minimise localised heating problems. The very high data rates that GDDR5 pins run at mean that the long distance from MCs to pads must use up a fair amount of power.

This layout problem also exists in G80 and GT200.

Scalar ALU processing with fine-grained scheduling of individual instructions means that NVidia has relatively high power overheads, too. A feature from G80 onwards. VLIW approach in ATI saves power and area, but at the cost of increased compilation complexity - though not all compilation problems in ATI are centred upon VLIW-usage.

So solely attributing GF100's problems to Fermi architecture is tricky. And the 40nm fuck-up plays a part too, handicapping bigger chips, simply because the process is naturally leaky and bigger chips/higher temperatures tend to act as positive feedback.

Though it seems to me NVidia dug its own grave on two key points: leaving GDDR5 implementation on any chip until too late and targetting a meaningless performance level when a notably smaller chip (GF104) would get to within 20-30% of that performance level.

But then, for some reason, NVidia has been failing to execute for years now - in that environment GF100 looks kinda doomed anyway.

Jawed
 
The very high data rates that GDDR5 pins run at mean that the long distance from MCs to pads must use up a fair amount of power.
I could be wrong but afaik, interconnect power consumption is a relatively smaller problem in the overall power consumption picture.
 
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

Nehalem based designs have independent power planes per core though only they share the same voltage regulator.
 
I could be wrong but afaik, interconnect power consumption is a relatively smaller problem in the overall power consumption picture.
Memory clocks are lowered to save power, depending on the workload, e.g. single-screen Windows desktop will have lowest clocks. Memory chips themselves, obviously, take a chunk of power.

There's no clear information for GPU memory interface power requirements that I'm aware of - merely that the power pad density is quite high (and some of that is merely localised grounding for noise reduction).

Jawed
 
Memory clocks are lowered to save power, depending on the workload, e.g. single-screen Windows desktop will have lowest clocks. Memory chips themselves, obviously, take a chunk of power.

There's no clear information for GPU memory interface power requirements that I'm aware of - merely that the power pad density is quite high (and some of that is merely localised grounding for noise reduction).

Jawed

What I am saying is that while you try to save every mW, from wherever, that you can, interconnect power consumption should be pretty low on your 'suspect list' for gf100.
 
What I am saying is that while you try to save every mW, from wherever, that you can, interconnect power consumption should be pretty low on your 'suspect list' for gf100.
I think you're wrong ;) Otherwise there'd be no down-clocking of memory.
 
Memory or the Memory Controller? One cannot go without the other I guess (see RV770's display "corruption")
Very hard to separate them.

The "broken" memory controller in RV770 (can't downclock GDDR5 properly) is one of the key reasons HD4870's idle power is so awful.

Jawed
 
I think you're wrong ;) Otherwise there'd be no down-clocking of memory.

I think you misunderstood me.

Downclocking memory saves power, sure.

The distance of MC's from pads in gf100, afaics, is a minor contributor to power/heat levels of that board.
 
The distance of MC's from pads in gf100, afaics, is a minor contributor to power/heat levels of that board.
Distance=power and with data rates at ~4 GT/s power is worse than with GDDR3 whose data rates are 2 GT/s+. There's no way to quantify this, though.
 
Distance=power and with data rates at ~4 GT/s power is worse than with GDDR3 whose data rates are 2 GT/s+. There's no way to quantify this, though.
Well, I'm sure there's a way, just not without vastly more information than is available to people like us ;)
 
Nehalem based designs have independent power planes per core though only they share the same voltage regulator.
I don't really count that as fully independent power planes. They may be separated internally (for power gating), but externally they only have one vcc plane (otherwise that would imply different pinouts for dual and quad cores for example), at least in the diagrams I saw (and the official intel specs). I don't count that as independent power planes since it is impossible to have separate voltage regulators with this design.
 
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

IIRC, the AMD CPUs built for notebooks that came out around the launch of Phenom had, at
least in design, separate voltage planes for the 2 cores and and the uncore. That is 3 of them.

These CPUs were K8 based, not K10, with some updates to the architecture.
 
IIRC, the AMD CPUs built for notebooks that came out around the launch of Phenom had, at least in design, separate voltage planes for the 2 cores and and the uncore. That is 3 of them.

These CPUs were K8 based, not K10, with some updates to the architecture.
Yes, forgot about those, that was griffin core. I wonder if any of the puma platform designs actually implemented this. AFAIK the newer cores no longer support this (maybe also because the platform needs to support more than 2 cores)?
The more cores you have though the more difficult this would be to implement - completely infeasible for gpus.
 
Back
Top