NVIDIA GF100 & Friends speculation

mczak · Jun 28, 2010

Jawed said:
x86 has the concept of per-core power planes with independent voltage control.

I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

Jawed · Jun 28, 2010

Picao84 said:
In a not distributed architecture, without GPCs, and TMU's attached to the SMs, there would be just a single TMU structure right?

No. Most GPUs have the TMUs close to the ALUs they serve. Texture data is mostly very highly compressed. After filtering, it has expanded enormously. So it makes sense to have the texture data travel in compressed form as far as possible - to get texture data as close to the ALUs as possible while using the least bandwidth.

After that the issue is then that ALU:TEX in shaders implies a relatively low TMU throughput, hence a degree of TMU sharing amongst ALUs.

Xenos is the closest to your proposal, theoretically.

Jawed · Jun 28, 2010

mczak said:
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

Whoops I wrongly attributed this to power planes - the technology in Nehalem is actually power gating:

http://www.anandtech.com/show/2594/12

which is much easier to implement. So for NVidia to implement per-core voltage control would be very difficult, and so I shouldn't really have brought it up as a factor.

Jawed

Picao84 · Jun 28, 2010

Jawed said:
No. Most GPUs have the TMUs close to the ALUs they serve. Texture data is mostly very highly compressed. After filtering, it has expanded enormously. So it makes sense to have the texture data travel in compressed form as far as possible - to get texture data as close to the ALUs as possible while using the least bandwidth.

After that the issue is then that ALU:TEX in shaders implies a relatively low TMU throughput, hence a degree of TMU sharing amongst ALUs.

Xenos is the closest to your proposal, theoretically.

Thanks. And about power consumption? Making it more the GF100 way, wouldnt it have any effect on power consumption?

Jawed · Jun 28, 2010

Bandwidth costs power.

What's notable about Fermi architecture is that supposedly a variety of local buffers have been removed in favour of buffering through the L1/L2 cache hierarchy. This is a generalisation applied to the producer-consumer nature of a pipelined parallelising throughput processor. This generalisation allows greater flexibility in workloads. e.g. it should lead to higher minimum frame-rates (buffers shouldn't run out of space halting the producer that feeds into them). But it means that some kinds of data are moving further in this architecture than in prior GPUs.

Parallel setup across 4 GPCs in GF100 is one task that creates a power overhead, because there's 2-way synchronisation of workload in order to allow setup to function in a distributed fashion.

Overall, though, NVidia's architecture is not particularly power-efficient. Though problems with older chips (reliance upon a surfeit of TMUs and ROPs that never deliver their rated throughputs) have been tackled quite effectively. Fermi's current TMUs are doing very well as far as gaming is concerned, even if absolute throughput isn't exciting.

e.g. the memory controllers are placed at the centre of the die, as far as possible from the I/O pads that communicate with memory. In ATI the memory controllers are placed as close as possible to the pads (admittedly we've only seen RV770's die). It seems to me NVidia does this because the ALUs, consuming lots of power, need to be spread out across as much area, to minimise localised heating problems. The very high data rates that GDDR5 pins run at mean that the long distance from MCs to pads must use up a fair amount of power.

This layout problem also exists in G80 and GT200.

Scalar ALU processing with fine-grained scheduling of individual instructions means that NVidia has relatively high power overheads, too. A feature from G80 onwards. VLIW approach in ATI saves power and area, but at the cost of increased compilation complexity - though not all compilation problems in ATI are centred upon VLIW-usage.

So solely attributing GF100's problems to Fermi architecture is tricky. And the 40nm fuck-up plays a part too, handicapping bigger chips, simply because the process is naturally leaky and bigger chips/higher temperatures tend to act as positive feedback.

Though it seems to me NVidia dug its own grave on two key points: leaving GDDR5 implementation on any chip until too late and targetting a meaningless performance level when a notably smaller chip (GF104) would get to within 20-30% of that performance level.

But then, for some reason, NVidia has been failing to execute for years now - in that environment GF100 looks kinda doomed anyway.

Jawed

Picao84 · Jun 28, 2010

Thanks Jawed. Lets hope they have seen the light with GF104.

Jawed · Jun 28, 2010

Picao84 said:
Thanks Jawed. Lets hope they have seen the light with GF104.

Well if it was meant to be on 32nm, then NVidia might be in the middle of another fight right now, since TSMC deleted 32nm.

rpg.314 · Jun 28, 2010

Jawed said:
The very high data rates that GDDR5 pins run at mean that the long distance from MCs to pads must use up a fair amount of power.

I could be wrong but afaik, interconnect power consumption is a relatively smaller problem in the overall power consumption picture.

aaronspink · Jun 28, 2010

mczak said:
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

Nehalem based designs have independent power planes per core though only they share the same voltage regulator.

Jawed · Jun 28, 2010

rpg.314 said:
I could be wrong but afaik, interconnect power consumption is a relatively smaller problem in the overall power consumption picture.

Memory clocks are lowered to save power, depending on the workload, e.g. single-screen Windows desktop will have lowest clocks. Memory chips themselves, obviously, take a chunk of power.

There's no clear information for GPU memory interface power requirements that I'm aware of - merely that the power pad density is quite high (and some of that is merely localised grounding for noise reduction).

Jawed

rpg.314 · Jun 29, 2010

Jawed said:
Memory clocks are lowered to save power, depending on the workload, e.g. single-screen Windows desktop will have lowest clocks. Memory chips themselves, obviously, take a chunk of power.

There's no clear information for GPU memory interface power requirements that I'm aware of - merely that the power pad density is quite high (and some of that is merely localised grounding for noise reduction).

Jawed

What I am saying is that while you try to save every mW, from wherever, that you can, interconnect power consumption should be pretty low on your 'suspect list' for gf100.

Jawed · Jun 29, 2010

rpg.314 said:
What I am saying is that while you try to save every mW, from wherever, that you can, interconnect power consumption should be pretty low on your 'suspect list' for gf100.

I think you're wrong

Otherwise there'd be no down-clocking of memory.

neliz · Jun 29, 2010

Jawed said:
I think you're wrong Otherwise there'd be no down-clocking of memory.

Memory or the Memory Controller? One cannot go without the other I guess (see RV770's display "corruption")

Jawed · Jun 29, 2010

neliz said:
Memory or the Memory Controller? One cannot go without the other I guess (see RV770's display "corruption")

Very hard to separate them.

The "broken" memory controller in RV770 (can't downclock GDDR5 properly) is one of the key reasons HD4870's idle power is so awful.

Jawed

rpg.314 · Jun 29, 2010

Jawed said:
I think you're wrong Otherwise there'd be no down-clocking of memory.

I think you misunderstood me.

Downclocking memory saves power, sure.

The distance of MC's from pads in gf100, afaics, is a minor contributor to power/heat levels of that board.

Jawed · Jun 29, 2010

rpg.314 said:
The distance of MC's from pads in gf100, afaics, is a minor contributor to power/heat levels of that board.

Distance=power and with data rates at ~4 GT/s power is worse than with GDDR3 whose data rates are 2 GT/s+. There's no way to quantify this, though.

KimB · Jun 29, 2010

Jawed said:
Distance=power and with data rates at ~4 GT/s power is worse than with GDDR3 whose data rates are 2 GT/s+. There's no way to quantify this, though.

Well, I'm sure there's a way, just not without vastly more information than is available to people like us

mczak · Jun 29, 2010

aaronspink said:
Nehalem based designs have independent power planes per core though only they share the same voltage regulator.

I don't really count that as fully independent power planes. They may be separated internally (for power gating), but externally they only have one vcc plane (otherwise that would imply different pinouts for dual and quad cores for example), at least in the diagrams I saw (and the official intel specs). I don't count that as independent power planes since it is impossible to have separate voltage regulators with this design.

Kej · Jun 29, 2010

mczak said:
I don't know of any chip which actually implements that, though of course if you have multi-socket systems then voltage is indeed independent. All they have is a separate power plane for north bridge, but all cores still share a single power plane.
I suspect it just adds too much complexity (== cost) to the board.

IIRC, the AMD CPUs built for notebooks that came out around the launch of Phenom had, at
least in design, separate voltage planes for the 2 cores and and the uncore. That is 3 of them.

These CPUs were K8 based, not K10, with some updates to the architecture.

mczak · Jun 29, 2010

Kej said:
IIRC, the AMD CPUs built for notebooks that came out around the launch of Phenom had, at least in design, separate voltage planes for the 2 cores and and the uncore. That is 3 of them.

These CPUs were K8 based, not K10, with some updates to the architecture.

Yes, forgot about those, that was griffin core. I wonder if any of the puma platform designs actually implemented this. AFAIK the newer cores no longer support this (maybe also because the platform needs to support more than 2 cores)?
The more cores you have though the more difficult this would be to implement - completely infeasible for gpus.

NVIDIA GF100 & Friends speculation

mczak

Jawed

Jawed

Picao84

Jawed

Picao84

Jawed

rpg.314

aaronspink

Jawed

rpg.314

Jawed

neliz

GIGABYTE Man

Jawed

rpg.314

Jawed

KimB

mczak

Kej

mczak

Similar threads