22 nm Larrabee

Multicycle ops could allow parts of a segment of the scheduler to gate off some of the time, and they can reduce pressure on uop cache and ROB. This can lead to savings, but I'm hesitant to ascribe a blanket 2x improvement when larger changes have lead to more modest jumps.
These larger changes to attempt to reduce power consumption have always had to deal with the same instruction rate. The new focus is on keeping the throughput the same with fewer instructions instead...

Non-destructive instructions, FMA, and gather, all achieve that. Also note that future x86 processors might perform macro-op fusion on a pair of dependent copy and arithmetic instructions. And last but not least there's sequenced vector execution. So what I'm trying to say is that you can't really conclude anything from past architectural changes.

Granted, GPUs already feature all of this, but it shows just how much room for improvement CPUs have left, allowing them to converge close enough to make a homogeneous architecture superior.
 
Keeping 256-bit registers encourages an increase in the number of registers, since 1024-bit mode would consume rename space more quickly.
Keep in mind that since they take four cycles to execute, the latency hiding per logical register is increased. The same amount of physical registers allows the same amount of latency hiding, regardless of whether you're using out-of-order execution, or sequenced vector execution.

Sandy Bridge has 144 x 256-bit registers. Note that with two threads per core and 16 logical 1024-bit registers, you need at most 128 physical rename registers. So there's no strict need to increase the number of registers if you want to cover the same latency. And as long as there are independent instructions, those few extra registers will suffice to continue progress.
 
Ok. So your AVX-1024 unit would need four cycles to finish its work. Except for the ability to process wider workloads and the power gating opportunities, I cannot see how this would improve compute density or throughput compared to an otherwise identical AVX-256 which would be able to retire one operation per cycle.
Yes, executing AVX-1024 on 256-bit units serves 'only' two purposes: reducing power consumption and implicitly increasing the amount of registers for latency hiding. It does not increase compute density.
 
Yes, executing AVX-1024 on 256-bit units serves 'only' two purposes: reducing power consumption and implicitly increasing the amount of registers for latency hiding. It does not increase compute density.

Ok thanks! I was desperately trying to find the improvements I thought you proposed also for compute density. But since you didn't propose that in the first place, this misunderstanding is cleared up now it seems. Thanks again.
 
It doesn't stall execution, which is all we really care about.
An OoO CPU tends to have instruction issue as a limiting factor, rather than data latency. Halting the front end means a stall in further instruction fetches, branch predictions, or bringing in additional instructions for reordering. This can have an impact on other instruction types or other threads, since neither the OoO engine or its multithreading can easily reorder around a stall in the front end.
If the supply of non-AVX ops in flight dwindles, the bubble can be exposed.

While executing AVX-1024 code, the reservation station will quickly become full. This means the logic to receive new uops from the front-end can be clock gated for a while (and as mentioned before, once the ROB becomes full the entire front-end can be clock gated).
I'm not sure if reservation station is the correct term for what Sandy Bridge uses. I think the physical register file means there isn't a collector of operands the front of each issue port. Going with the reservation station idea, a full reservation station can cause the in-order front end to stall before the ROB or other ports are saturated.

Exception handling would essentially be no different than with cracked vector operations. Note that with the Pentium 4's cracked vector operations you got two identical uops, except for the physical register numbers. These can be fused together, and you only need a tiny bit of logic in a few places to sequence the register numbers.
I do not see how this would be compatible with the desire to gate the scheduler. The P4 cracked vector operations into separate uops that were actively issued by the scheduler as if they came from separate instructions. I was under the impression that an AVX 1024 op would not be cracked into multiple uops. Deferring the writeback of the exception status until the last cycle would allow the data path that updates the ROB status entries to be gated for several cycles.
Catching the exception status after every cycle would leave it active.
The uop would probably have a 4x as wide exception status section, which could mean that the path could be 4x as wide. However, the plus side to that is that this same path could be 3/4 gated off during normal mode AVX.


A potentially simpler alternative would be to drain the execution pipelines and put the entire core on a 1/4 clock regime, except for the ALU, register file and caches, once an AVX-1024 instruction reaches the scheduler.
That sounds like it would too seriously compromise the scalar performance of the core. Also, the L/S units and AGUs would need to be kept at speed if the cache is kept at speed.

Sandy Bridge has 144 x 256-bit registers. Note that with two threads per core and 16 logical 1024-bit registers, you need at most 128 physical rename registers. So there's no strict need to increase the number of registers if you want to cover the same latency. And as long as there are independent instructions, those few extra registers will suffice to continue progress.
144 256-bit registers is 36 1024-bit registers. The physical register file hosts both architectural and rename registers.
32 of them would be the architectural state registers for two threads.
That leaves 4 rename registers . For an AVX instruction with a 5+ cycle latency, the 4-cycle writeback phase would not occur immediately, so the pipeline would need need independent instructions to hide the latency. Each new instruction takes a rename register, of which there are only 4 remaining.
Keeping a SB-sized register file would bring a GPU-like occupancy problem to the OoO engine, and bring up acute software scheduling pressures that the OoO engine used to handle transparently.
Also, once the 4 rename registers are taken, the front end will stall if it encounters another AVX instruction.

This could be avoided by expanding register capacity in some way, possibly by having native 1024 bit registers or more of the smaller registers to prevent a pretty significant regression in scheduling capability.

Other architectures may segregate the super extended register instructions to a separate set of ports or a coprocessor model like BD or its APU descendants.
 
Each new instruction takes a rename register, of which there are only 4 remaining.
I'm not sure if that's the case. You only need a renamed register on a WAR-dependency of an instruction that has started execution but is not yet retired. With, say, 8 cycles between rename and retire, and two 256-bit AVX units, you need at most 16 extra 256-bit registers. Hence a total of 144 registers may suffice for AVX-1024.
 
I'm not sure if that's the case. You only need a renamed register on a WAR-dependency of an instruction that has started execution but is not yet retired.
A rename register is assigned as the destination of every new instruction. This avoids WAR and WAW hazards. In addition, this makes it easier to roll back in a speculative pipeline, since the instructions write to a location that can be safely ignored in the case of a mispredict.

In the case of a physical register file, a rename register also needs to be allocated because that is the only place results are stored.

With, say, 8 cycles between rename and retire, and two 256-bit AVX units, you need at most 16 extra 256-bit registers. Hence a total of 144 registers may suffice for AVX-1024.

The remaining 16 physical registers amount to 4 1024-bit logical registers. Each group of 4 physical registers needs to retire as a unit.
How does the core make 16 256-bit physical registers pretend to be more than 4 1024-bit registers?
 
I'm not sure if that's the case. You only need a renamed register on a WAR-dependency of an instruction that has started execution but is not yet retired. With, say, 8 cycles between rename and retire, and two 256-bit AVX units, you need at most 16 extra 256-bit registers. Hence a total of 144 registers may suffice for AVX-1024.

8 cycles between rename and retire? What happens if you miss in cache? This is a terrible idea, which is precisely why Intel will do no such thing...

DK
 
I'm not sure if that's the case. You only need a renamed register on a WAR-dependency of an instruction that has started execution but is not yet retired. With, say, 8 cycles between rename and retire, and two 256-bit AVX units, you need at most 16 extra 256-bit registers. Hence a total of 144 registers may suffice for AVX-1024.

Eight cycles?!

Christ almighty, that is a severe penalty. That would kill the speed of many things...
 
A rename register is assigned as the destination of every new instruction. This avoids WAR and WAW hazards. In addition, this makes it easier to roll back in a speculative pipeline, since the instructions write to a location that can be safely ignored in the case of a mispredict.

In the case of a physical register file, a rename register also needs to be allocated because that is the only place results are stored.
I stand corrected. I imagined it was possible to use the logical register in case there were no WAR hazards. Completion and retirement would then coincide. But that probably doesn't allow sufficient time to resolve branches?
 
8 cycles between rename and retire? What happens if you miss in cache?
Hogging a physical register while waiting on a load operation seems like an unnecessary waste to me. Independent arithmetic instructions could use the register in the meantime. Since the load unit has its own storage, it seems to me that the register in the PRF doesn't have to be allocated till the load instruction is near completion.
This is a terrible idea, which is precisely why Intel will do no such thing...
Although I was wrong to assume that the existing 144 registers would suffice, that number is not set in stone. Since executing AVX-1024 on 256-bit units does not increase throughput, I believe only the extra storage of 1024-bit logical registers is required. That's 2x16x3 256-bit registers, or an increase from 144 to 240 registers. That doesn't seem unreasonable to me, especially since we can't realistically expect this before 2015 anyway.
 
Eight cycles?!

Christ almighty, that is a severe penalty. That would kill the speed of many things...
I couldn't find any precise data for contemporary CPUs, but since a floating-point multiplication takes 5 cycles I thought 8 cycles from rename (allocation of a register) to retirement (freeing of a register) seemed realistic (and possibly even a bit optimistic). It made the math work out beautifully as well, although it's now clear to me that more than 16 spare registers are required...

Anyway, you seem to imply that it should be significantly less than 8 cycles? Do you have any more information on it?
 
64 cores at 2Ghz, half rate DP should pull in about 2T DP.

The package seemed to be ~3in * 3in in size. The real surprise was that it wasn't in board form ( Are big test chips done that way?). I am hoping that this time Intel will learn it's lesson and offer an in socket cache coherent accelerator.
 
Unfortunately unless they resurrect their mass market GPU strategy for these, the average person won't be able to get there hands on them.
Don't worry, Haswell will already bring us AVX2 with gather and FMA instructions, offering ~1 SP GFLOPS for an octa-core.

And as I've said many times before in this thread, the next step is 1024-bit AVX instructions executed over 4 cycles, giving us great latency hiding, lower power consumption, and higher efficiency.
 
If Intel wanted to make the Larrabee ISA available to as many people as possible, my suggestion would be to include a small number of cores in some of their consumer GPUs and tout it as a physics accelerator (remember they control Havok). I'm pretty sure that won't happen though, and I don't really think even Intel thinks Larrabee is the future at this point.

If you wanted AVX2 to be a viable option, personally I wouldn't widen it to 1024-bit-over-4-cycles, but rather include a ridiculously beefy 256-bit AVX2 with FMA pipeline (at least as fast as Haswell) as an option to the 22nm Silvermont Atom core. And I wouldn't just clock gate it; I'd power gate it like ARM optionally does for their NEON SIMD (obviously this adds some scheduling complications if you care about maximizing performance but I expect it to be manageable).

Also, rpg, if KC clocks are only slightly higher than 1GHz on 22nm, then that means the pipelines are relatively short, and a shrink to 14nm won't change that - only a significant architectural revision would.
 
Back
Top