22 nm Larrabee

If you wanted AVX2 to be a viable option, personally I wouldn't widen it to 1024-bit-over-4-cycles, but rather include a ridiculously beefy 256-bit AVX2 with FMA pipeline (at least as fast as Haswell) as an option to the 22nm Silvermont Atom core. And I wouldn't just clock gate it; I'd power gate it like ARM optionally does for their NEON SIMD (obviously this adds some scheduling complications if you care about maximizing performance but I expect it to be manageable).
Sandy Bridge already power gates the AVX units. Agner Fog discovered that the latency of the instructions depends on whether any AVX instructions have been executed recently.

Clock gating the front-end is orthogonal to that. And the other benefit of executing 1024-bit instructions over 4 cycles is the latency hiding. That's something that would definitely benefit Atom too.
 
I haven't seen anything stating they power gate the vector unit, much less the AVX units that are part of it.
An implementation of power gates on that would be a physically difficult proposition.

The speculation I've seen is that the warmup period for AVX is a microcoded sequence of NOPs and FP instructions that gradually ramps up the FPU to full power so that the full-width vector units don't put too much of a load on the chip's power delivery network too fast for it to compensate.
 
Sandy Bridge already power gates the AVX units. Agner Fog discovered that the latency of the instructions depends on whether any AVX instructions have been executed recently.

Clock gating the front-end is orthogonal to that. And the other benefit of executing 1024-bit instructions over 4 cycles is the latency hiding. That's something that would definitely benefit Atom too.

David Kanter said some time ago on RWT that AVX is definitely NOT power gated.
 
Last edited by a moderator:
Unfortunately unless they resurrect their mass market GPU strategy for these, the average person won't be able to get there hands on them.

I am afraid that if they do not go the consumer GPU route, even HPC will not be able to afford them, crazy cross subsidization / on die integgration aside.
 
AVX2 is obviously NOT power gated today. The warmup delay would be much longer (which poses all sorts of issues).

re Rattner saying future MIC will use Atom cores: here's hoping he means the 22nm OoOE Silvermont Atom cores and not the current ones. If so, that's certainly a good solution if they're willing to have both AVX2 and a MIC-only ISA, but it doesn't really solve anything in terms of mass market availability for developers.
 
AVX2 is obviously NOT power gated today. The warmup delay would be much longer (which poses all sorts of issues).

re Rattner saying future MIC will use Atom cores: here's hoping he means the 22nm OoOE Silvermont Atom cores and not the current ones. If so, that's certainly a good solution if they're willing to have both AVX2 and a MIC-only ISA, but it doesn't really solve anything in terms of mass market availability for developers.

It would be amusing to say the least to have something that does both AVX2 and vec16 ISA at the same time, even if it's two things on the same die.
 
Last edited by a moderator:
I haven't seen anything stating they power gate the vector unit, much less the AVX units that are part of it.
AnandTech: "The upper 128-bits of the execution hardware and paths are power gated. Standard 128-bit SSE operations will not incur an additional power penalty as a result of Intel’s 256-bit expansion."
An implementation of power gates on that would be a physically difficult proposition.
What makes you think that?
The speculation I've seen is that the warmup period for AVX is a microcoded sequence of NOPs and FP instructions that gradually ramps up the FPU to full power so that the full-width vector units don't put too much of a load on the chip's power delivery network too fast for it to compensate.
If they're not power gated, why would a few more adders and multipliers require such a careful gradual ramp up? There are plenty of other non-AVX things you can do to create a local spike in power consumption, yet as far as I know these don't require microcoded sequences to ramp up the power delivery.

Anyhow, regardless of what they do right now it's of little relevance to where CPUs, MICs and GPUs are heading next.
 
What makes you think that?
The description of the power gates used indicates they are notably larger and have very long latency.
Since the other half of the 256-bit data paths linked to it is the the data path for the integer units, which are still usable on demand, it didn't seem like a good idea.

If they're not power gated, why would a few more adders and multipliers require such a careful gradual ramp up?
It doubles the actively switching load of the vector unit. Clock gated units can draw much less power when they are quiescent. The transistors would still have static leakage to contend with, but this is far below peak draw.

There are plenty of other non-AVX things you can do to create a local spike in power consumption, yet as far as I know these don't require microcoded sequences to ramp up the power delivery.
Wakeup from the deeper sleep states has latency associated with it, just on a wider scale than waking up the AVX unit. There are actions implied to bringing the pipeline back online that don't need to be exposed to the programmer.

Anyhow, regardless of what they do right now it's of little relevance to where CPUs, MICs and GPUs are heading next.
Haswell may not be a significant shift as far as client platforms are concerned.
We will need to defer until a later successor architecture to see if the direction has changed.
Intel slides show a quad core Haswell with GPU being able to max out at 95W TDP, back up from the temporary drop to 77W with Ivy Bridge.
The next change may be the architecture after that.
 
re Rattner saying future MIC will use Atom cores: here's hoping he means the 22nm OoOE Silvermont Atom cores and not the current ones. If so, that's certainly a good solution if they're willing to have both AVX2 and a MIC-only ISA, but it doesn't really solve anything in terms of mass market availability for developers.

I'm curious if there have been any disclosures about Silvermont. For example, how many threads will it have? The case for four threads in the embedded/mobile space isn't too strong, but Larrabee really leaned on having the extra threads in waiting.

Rattner's hope to have a future MIC as a socketed processor is a repeat of what Larrabee was hoped to do, but I suspect the old headwinds of undercutting the Xeon group with a large but cheap chip are still in place. That could be remedied by charging Xeon prices for it, but that would hurt mass availability.
 
Since the other half of the 256-bit data paths linked to it is the the data path for the integer units, which are still usable on demand, it didn't seem like a good idea.
Only the multiplier overlaps with the integer data path. There are five new 128-bit data paths which are only used by 256-bit AVX instructions, and therefore could potentially be power gated.
It doubles the actively switching load of the vector unit. Clock gated units can draw much less power when they are quiescent. The transistors would still have static leakage to contend with, but this is far below peak draw.
Is that significant at all? ALUs only account for a fraction of a CPU's total power consumption, and we're looking at only half of a portion of them, per core. It just seems to me that there must be things other than 256-bit AVX execution that are more critical to power consumption, yet to my knowledge don't require any ramp up sequences.

Power gating seems a more likely explanation of the transient delay.
 
Well with all due respect, David also said adding gather support to AVX would not be feasible, a few days before the AVX2 announcement...

Let's withhold judgement until we see the performance of gather.

Besides, I think David's sources and knowledge trumps Anand's anyway.
 
Let's withhold judgement until we see the performance of gather.
This forum is all about speculation, and you want to withhold judgement?

I'm convinced it will take three micro-instructions (one to extract the mask, one to perform the actual gather, and one to perform the final blend), with a maximum throughput of one instruction per cycle when all elements are from the same cache line. The micro-op breakdown is obvious considering that Haswell won't support FMA4 and the current vblendv instruction can take three 256-bit source registers thanks to a movmsk micro-instruction which extracts the mask and passes it as an immediate to a vblend micro-instruction.

I'm also expecting Haswell to feature one regular 256-bit load unit and one 256-bit gather unit per core. It needs the extra L1 bandwidth for the FMA instructions, and this setup would allow the gather unit to have a slightly higher latency and thus a reasonably power efficient implementation.

It would be irrational of Intel to aim for anything less. LRBni supports 512-bit gather, and they wouldn't add gather to AVX2 if it wasn't efficient. 2 x 256-bit execution begs for a high performance gather instruction. Also, there's nothing reasonable in between sequential extract/insert, and gathering from one cache line at a time, so it has to be the latter. And lastly note that the coherency rules for the instruction are consistent with such an implementation.
Besides, I think David's sources and knowledge trumps Anand's anyway.
Again, neither his sources nor his knowledge helped him foresee gather support for AVX2...
 
Only the multiplier overlaps with the integer data path. There are five new 128-bit data paths which are only used by 256-bit AVX instructions, and therefore could potentially be power gated.
The data paths are the "stacks" the AVX blocks straddle in the diagram, not the individual units that are sometimes subsumed by the new blocks. If a blue unit is covered by a yellow block, I interpret that to mean that execution hardware is also shared.

Is that significant at all? ALUs only account for a fraction of a CPU's total power consumption, and we're looking at only half of a portion of them, per core. It just seems to me that there must be things other than 256-bit AVX execution that are more critical to power consumption, yet to my knowledge don't require any ramp up sequences.
My terminology was imprecise and misleading. The watts dissipated isn't the overriding factor. When I brought up power delivery, it had to do with layers responsible for providing a proper level voltage to the FPU, but I muddled my description pretty severely. When a significant amount of logic becomes active in a short time while the voltage levels are adjusted to a reduced electrical load, it can compromise the correct functioning of the hardware.

The speculation on how this can apply to FP units was on realworldtech:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=122180&threadid=122176&roomid=2
 
The speculation on how this can apply to FP units was on realworldtech:
http://www.realworldtech.com/forums/index.cfm?action=detail&id=122180&threadid=122176&roomid=2
That's about the entire FPU. Sure it makes no sense to power gate it, and it does make sense to clock gate it when running integer-only threads and perform a ramp-up sequence when activating it.

But that's not what I'm talking about. I'm talking about the extra ALUs that turn the 128-bit SSE units into 256-bit AVX units. AVX is still rarely used, so it makes sense to power gate this upper 128-bit part.

Agner Fog: "I found that this doubled throughput is obtained only after a warm-up period of several hundred floating point operations. In the "cold" state, the throughput is only half this value, and the latencies are one or two clocks longer."

In other words, the 128-bit SSE execution units can execute 256-bit AVX instructions, split over two cycles, while the upper halves take their own sweet time to power up. This also suggests Intel won't have much difficulty executing 1024-bit instructions on 256-bit units.
 
That's about the entire FPU. Sure it makes no sense to power gate it, and it does make sense to clock gate it when running integer-only threads and perform a ramp-up sequence when activating it.
It was an entire FPU back when designs were simpler and FPUs were smaller in transistor count. The extra data paths and hardware for the upper half of AVX 256 have more logic than a two-pipe FPU from a 15 year old design.
In the case of old chip designs, it wasn't a question of clock gating. The difference between running FP instructions and not was enough to threaten the electrical stability of the chip back when clock gating was not pervasive.
Clock speeds these days are higher, voltage margins are thinner, and the disparity between active, idle, and completely dark silicon is wider.

But that's not what I'm talking about. I'm talking about the extra ALUs that turn the 128-bit SSE units into 256-bit AVX units. AVX is still rarely used, so it makes sense to power gate this upper 128-bit part.
It makes sense if the latency numbers make sense, and if it doesn't impact the active integer data paths the upper AVX units share.

Agner Fog: "I found that this doubled throughput is obtained only after a warm-up period of several hundred floating point operations. In the "cold" state, the throughput is only half this value, and the latencies are one or two clocks longer."
There were other measurements that put the warmup period around 70 cycles, which would translate to hundreds of instructions when there are several FP ports.

In other words, the 128-bit SSE execution units can execute 256-bit AVX instructions, split over two cycles, while the upper halves take their own sweet time to power up. This also suggests Intel won't have much difficulty executing 1024-bit instructions on 256-bit units.
A 128-bit unit can execute 256-bit instructions if they are cracked by the issue hardware. This has been done for multiple designs.
 
The P4 was described as executing a 128-bit operation as two uops, which points to the decoder doing it.

That seems too funky for SB because of the uop cache and the decode restrictions for the simple decoders.
I did not see an indication from Agner's manual that AVX 256 experienced a drop in decode throughput, which a split would seem to do.
Additionally, the move from the slow to full mode would require that the decoders know to switch from a double op to a single uop, and the uop cache entry would need to be invalidated and replaced.

Upon reviewing Agner's optimization, he mentioned that 128-bit mode also experiences an increase in latency during the warmup phase, but the throughput remains the same.

What if SB is running 128 bit ops down alternate halves of the AVX unit during warmup?
Hopping from low to high is a 1 cycle penalty.
It wouldn't affect throughput for 128 bit, but 256-bit would have nowhere to hide the deficit and then there is additional latency for recombining at the end until the AVX unit is fully on.

edit: An alternative explanation for the 128-bit latency increase may be that the unit's bypass capability is off initially, and is used more and more as the warmup progresses.
 
Last edited by a moderator:
The P4 was described as executing a 128-bit operation as two uops, which points to the decoder doing it.

That seems too funky for SB because of the uop cache and the decode restrictions for the simple decoders.
I did not see an indication from Agner's manual that AVX 256 experienced a drop in decode throughput, which a split would seem to do.
Additionally, the move from the slow to full mode would require that the decoders know to switch from a double op to a single uop, and the uop cache entry would need to be invalidated and replaced.
Good points. In that case the 'sequencing' logic I talked about earlier already exists in Sandy Bridge, and AVX-1024 could build on that.
 
Back
Top