22 nm Larrabee

dkanter · Jul 15, 2011

I think the problem with all the talk of 'dark silicon' is that it tends to make a number of very pessimistic assumptions about silicon scaling which are not true.

For example, people assume that voltage scaling is dead. Well, Sandy Bridge was designed to operate at 0.65V. That doesn't sound like voltage scaling is dead at all.

However, if you look at the long term trends, it's obvious that for power reasons you will favor a combination of fixed function hardware, throughput processors and latency processors. The idea that somehow a latency centric processor can displace all others is utterly idiotic.

DK

Nick · Jul 15, 2011

rpg.314 said:
There's no choice. Physics doesn't like homogeneous compute at this scale.

No, physics doesn't like a high switching activity across your entire chip. That means you still get to choose between only using a limited number of high activity cores, or using a limited portion of each core.

That's what executing AVX-1024 on 256-bit execution units is about. The ALUs remain active, performing useful work, while the majority of the control logic can be clock gated for up to 3/4 of the time. Similarly a single gather instruction replaces 18 extract/insert instructions which each involved the entire out-of-order execution pipeline and 128-bit ALU paths. So it's another big reduction in useless switching activity.

Gipsel · Jul 15, 2011

Nick said:
No, physics doesn't like a high switching activity across your entire chip. That means you still get to choose between only using a limited number of high activity cores, or using a limited portion of each core.

That's what executing AVX-1024 on 256-bit execution units is about. The ALUs remain active, performing useful work, while the majority of the control logic can be clock gated for up to 3/4 of the time. Similarly a single gather instruction replaces 18 extract/insert instructions which each involved the entire out-of-order execution pipeline and 128-bit ALU paths. So it's another big reduction in useless switching activity.

Or you execute the task on optimized high throughput cores which don't have this useless activity to start with.

Let those units do the task which have the best perf/W for it. That's all what this dark silicon thing is about: you can afford to put different cores in.

3dilettante · Jul 15, 2011

The latter paper seemed to be using ITRS predictions, which does seem to be more pessimistic. Is it because it tries to make a prediction for the entire industry, including less performant fabs?

Sandy Bridge can operate at .65V, but what are the parameters for this such as the product line and power state? Most (edit: desktop) designs seem to kick up to over 1V when doing something, and this has been my experience with SB as well.

.65V is a nice improvement from Banias, which had a lower bound for non ULV at .96V at 130nm.

Davros · Jul 15, 2011

iMacmatician said:
and 512 bit integer

forgive my noobishness but isnt 512bit a number with 154 zero's after it, why would anyone have a need to be doing calculations with such big numbers

3dilettante · Jul 15, 2011

It's 512-bit AVX, so it's SIMD integer operations.

Davros · Jul 15, 2011

so in laymans terms its several smaller numbers ?

3dilettante · Jul 15, 2011

Yes, the 512-bit refers to the width of the register, which would contain 16 32-bit values or 8 64-bit values. The operation would be applied to each element within the register.

rpg.314 · Jul 15, 2011

Gipsel said:
Let those units do the task which have the best perf/W for it. That's all what this dark silicon thing is about: you can afford to put different cores in.

I'd put it differently.

Since you'll be powering down lot's of silicon anyway, might as well make sure that whatever is on is specialized for the current workload.

Nick · Jul 16, 2011

Gipsel said:
Or you execute the task on optimized high throughput cores which don't have this useless activity to start with.

Why outsource it when you can do it locally, minimizing data traffic and maximizing programmability? With AVX-1024, the CPU cores themselves would be "optimized high throughput cores" where the majority of useless switching activity is clock gated. The remaining active control logic would be comparable to a GPU's dynamic scheduling.

So GPGPU doesn't stand a chance. Software rendering is a much bigger challenge, but at least for systems where graphics isn't a primary concern (like with IGP-powered systems today) it should be a viable option. And since GPUs themselves become ever more programmable and software rendering has its own efficiency advantages the market for high throughput homogeneous CPUs would expand regardless of power consumption challenges. You can't replace programmable logic in a GPU with fixed-function logic at will.

That's all what this dark silicon thing is about: you can afford to put different cores in.

Or, you can afford ISA extensions like generic gather support and AVX-1024 which increase effective performance/Watt. And again, this isn't the end of the convergence...

Arun · Jul 16, 2011

dkanter said:
I think the problem with all the talk of 'dark silicon' is that it tends to make a number of very pessimistic assumptions about silicon scaling which are not true.

For example, people assume that voltage scaling is dead. Well, Sandy Bridge was designed to operate at 0.65V. That doesn't sound like voltage scaling is dead at all.

The primary constraint on voltage scaling is SRAM. For example, Icera uses the same standard SRAM cells provided by TSMC, but add quite a lot of logic around it to allow it to scale either lower in voltage at the minimum frequency or higher at the maximum frequency. But that adds a very large overhead: based on the die shots and the known die sizes, I estimate Tegra 2's L2 cache to have a density of 0.465mm²/Mbit whereas Icera on basically the same process takes 0.809mm²/Mbit.

Your own article on Sandy Bridge says: "The Sandy Bridge design team chose to use the core voltage for the L3 cache to reduce dynamic power. However, the low 0.65V target require new logic design to guarantee correct operation of the L3 cache and various arrays – especially given the increasing leakage and variation at 32nm."

I suspect this isn't only true of the L3 cache but also other SRAM arrays. Therefore what really scaled so well is the minimum voltage (mostly because of circuit design rather than process) and not the actual voltage for a given level of performance. That's a pretty dramatic difference if I'm right.

Certainly High-K (gate last) helps and Tri-Gate will do so as well, but for that reason and others I wouldn't be as optimistic as you are on the process front.

BTW David, I'd be much obliged if you (or someone else) could counter Nick's AVX-1024-over-4-cycles point. Besides not helping one iota for leakage and wasting both significant area and vector efficiency compared to an in-order core with AVX-256, it seems rather dubious to me for other reasons, but I'm not knowledgeable enough to be certain about it.

rpg.314 · Jul 16, 2011

Arun said:
BTW David, I'd be much obliged if you (or someone else) could counter Nick's AVX-1024-over-4-cycles point. Besides not helping one iota for leakage and wasting both significant area and vector efficiency compared to an in-order core with AVX-256, it seems rather dubious to me for other reasons, but I'm not knowledgeable enough to be certain about it.

Well, I am obviously venturing into dragon land here, but I'll take a shot anyway. :smile:

<asbestos clothes on>

a) Branch predictors, branch speculation, memory speculation, decoders will be unaffected by this, and collectively, they are probably bigger than ROB.

b) Intel combines int and float ROB and IIRC, has done so for a very long time. So unless they split that, ROB will be still running at full blast, since it will be provisioned for pure integer code which is much more irregular.

c) You can't shut down the full ROB, as it will probably stall the frontend of the pipeline as well and slow down the integer core which is on all the time.

d) Even AVX code depends on the int core for branching and looping, so any power saved here will affect AVX code, one way or another.

e) OoO helps AVX as well, so gating that will almost certainly affect AVX's scheduling as well. Immaculately scheduled AVX is impractical for anything that is not written in assembly, ie everyone. Besides, with the maturing of modern shader like languages, devs will move away from intrinsics, let alone assembly.

f) AFAICS, latency of going into and coming out of a gated state for ROB won't be zero cycles without paying some area and power cost of it's own.

<asbestos clothes off>

Bottom line, while it might save a little, I just don't see it saving an OoM.

Gipsel · Jul 16, 2011

Nick said:
Why outsource it when you can do it locally, minimizing data traffic and maximizing programmability?

The answer is simple. To get the optimal performance within the power budget of the chip. It doesn't need a rocket scientist to figure that out.

Nick said:
With AVX-1024, the CPU cores themselves would be "optimized high throughput cores"

I don't think the optimization would go that far it compromises the performance on latency bound serial code. With a throughput optimized core it does.

Nick said:
software rendering has its own efficiency advantages

Care to elaborate which ones?

Nick said:
You can't replace programmable logic in a GPU with fixed-function logic at will.

It's more the other way around: you can't replace a fixed function logic block with a more programmable one at will without a power disadvantage.

nAo · Jul 16, 2011

Also the notion that programmable HW gives you better data locality is baseless, it's not like fixed function units cannot share on-chip storage.

Gipsel · Jul 16, 2011

nAo said:
Also the notion that programmable HW gives you better data locality is baseless, it's not like fixed function units cannot share on-chip storage.

You are completely right of course, but it's not like it wasn't mentioned several times already, that any relevant dataset processed in a way only loosely resembling something one would call throughput oriented, will reside in the (shared) caches or memory anyway. It didn't seem Nick reacted to that argument in a meaningful way, so I decided to not repeating it the 4th time or so.

Nick · Jul 17, 2011

Arun said:
David, I'd be much obliged if you (or someone else) could counter Nick's AVX-1024-over-4-cycles point. Besides not helping one iota for leakage...

Are idle texture samplers, ROP, and rasterizer units powered off entirely?

...and wasting both significant area...

It's not a waste, it's just largely clock gated during throughput-oriented workloads to save power. It's essentially no different from an APU where the CPU side is clocked down, except that an APU isn't as flexible.

...and vector efficiency compared to an in-order core with AVX-256...

Today's GPUs have logical vector lengths of 2048 and 1024-bit, no? Also keep in mind that even when AVX-1024 is supported you can still use less wide vectors if that would improve efficiency for your particular algorithm.

Nick · Jul 17, 2011

rpg.314 said:
a) Branch predictors, branch speculation, memory speculation, decoders will be unaffected by this, and collectively, they are probably bigger than ROB.

There's no switching activity in the branch predictors unless a branch is encountered, of which throughput-oriented code has few. The 4 cycle execution of AVX-1024 instructions on 256-bit units also reduces branch frequency and subsequently also speculation. Decoders and ever other part of the front-end (fetch, dispatch, rename) can be clock gated as soon as the reorder buffer is sufficiently full.

b) Intel combines int and float ROB and IIRC, has done so for a very long time. So unless they split that, ROB will be still running at full blast, since it will be provisioned for pure integer code which is much more irregular.
c) You can't shut down the full ROB, as it will probably stall the frontend of the pipeline as well and slow down the integer core which is on all the time.

Allocation can also be clock gated when the ROB is full, completion can be clock gated per execution pipeline, and retirement just has to keep up with the issue rate. Due to in-order retirement and the AVX-1024 instructions taking 4 cycles any scalar instructions which are mixed in will still have a reduced completion rate.

d) Even AVX code depends on the int core for branching and looping, so any power saved here will affect AVX code, one way or another.

That's just a design choice, and it can evolve in any direction. Note that GCN appears to have just a single tiny scalar unit so there's no reason to fear that an x86 CPU would have underpowered scalar performance any time soon.

e) OoO helps AVX as well, so gating that will almost certainly affect AVX's scheduling as well. Immaculately scheduled AVX is impractical for anything that is not written in assembly, ie everyone. Besides, with the maturing of modern shader like languages, devs will move away from intrinsics, let alone assembly.

The scheduler would only be clock gated for 3 out of 4 cycles, during the in-order issue of the remaining three 256-bit portions of each 1024-bit instruction. You don't need out-of-order execution during this time, so no performance is lost.

f) AFAICS, latency of going into and coming out of a gated state for ROB won't be zero cycles without paying some area and power cost of it's own.

It's fine if clock gating the front-end takes several cycles, since it can fill the ROB faster than the instructions get issued anyway (especially with AVX-1024). Completion gating probably doesn't require more than the one cycle after issue. And retirement can lag behind completion at least another cycle.

Nick · Jul 17, 2011

Gipsel said:
The answer is simple. To get the optimal performance within the power budget of the chip.

You're not getting optimal performance nor optimal power consumption when moving data and control back and forth between heterogeneous cores, while you could have done things locally in a core that can handle any workload.

I don't think the optimization would go that far it compromises the performance on latency bound serial code. With a throughput optimized core it does.

How is that supposed to be a good thing?

An APU's IGP sacrifices sequential performance in favor of lower power consumption, however you still have to take the CPU cores into account as well! A homogeneous high-throughput CPU would just redistribute the power density, only without performance compromises.

Care to elaborate which ones?

As I've mentioned before, a software renderer can skip inactive or redundant operations.

It's more the other way around: you can't replace a fixed function logic block with a more programmable one at will without a power disadvantage.

Actually sometimes you can. Putting more control into the hands of developers can definitely lead to higher effective performance/Watt. Also keep in mind that ISA extensions can give software access to more specialized hardware, without all the overhead of inter-core data and control transfer.

Nick · Jul 17, 2011

nAo said:
Also the notion that programmable HW gives you better data locality is baseless, it's not like fixed function units cannot share on-chip storage.

There's a distinction between heterogeneity and fixed-function hardware. Texture samplers and gather units both accelerate specific functionality (though the former is more dedicated while the latter is more generic), and they're both interfaced at the L1 cache and register file level. A heterogeneous core on the other hand can only realistically share the LLC, not the L1 cache let alone the register file. There's no fine-grained power-efficient data reuse between heterogeneous cores.

So a homogeneous architecture really does not imply banning all fixed-function hardware. It just gives each core the same capabilities and characteristics, and maximizes data locality.

In particular for generic throughput computing there's no benefit to having a heterogeneous architecture. The power efficiency of in-order execution can also be achieved with AVX-1024. Every other aspect of homogeneous processing, is a clear win.

Gipsel · Jul 17, 2011

Nick said:
You're not getting optimal performance nor optimal power consumption when moving data and control back and forth between heterogeneous cores, while you could have done things locally in a core that can handle any workload.

I think several people told you several times already, that there is no significant additional data movement involved.

Nick said:
How is that supposed to be a good thing?

It shows that it can go further in the optimization for throughput oriented workloads. And more optimized for the workload means a higher performance in the given power budget. And that is a good thing.

Nick said:
As I've mentioned before, a software renderer can skip inactive or redundant operations.

And that can't be done on a GPU?

Nick said:
Actually sometimes you can.

That PCU doesn't replace anything, it adds. By the way, I think you can do very similar stuff as proposed there in the hull/domain shader stages.

22 nm Larrabee

dkanter

Nick

Gipsel

3dilettante

Davros

3dilettante

Davros

3dilettante

rpg.314

Nick

Arun

Unknown.

rpg.314

Gipsel

nAo

Nutella Nutellae

Gipsel

Nick

Nick

Nick

Nick

Gipsel

Similar threads