22 nm Larrabee

I guess you could work around it by pretending that the vector register is actually made up of several different registers internally. But that would probably also complicate a lot of other things. So it might not be a win overall, *shrugs*.
 
AMD should look beyond vector and come up with matrix instructions.
Implementing them on GPU portion of the APU would be piece of cake. :)
 
It has been confirmed that Haswell will also support FMA: http://origin-software.intel.com/en...l-new-instruction-descriptions-now-available/. Note that "our floating-point multiply accumulate significantly increases peak flops" also pretty much confirms Haswell will feature two 256-bit FMA units per core.

The 22 nm + FinFET advantage makes an 8-core mainstream CPU capable of 1 TFLOP quite feasible. Note that this is a tenfold increase over Nehalem! The era of heterogeneous computing is about to end. The only thing lacking is reducing out-of-order power consumption by executing 1024-bit instructions on 256-bit execution units.
 
Last edited by a moderator:
I do hope you're keeping this statement confined to the context of x86 as a remark that integrated GPUs will cease to grow.
Why? Intel can scale this to a high number of cores and regain dominance over the HPC market. Also, the IGP will vanish since the CPU cores are far more powerful. Once the majority of low-end graphics solutions consists of software rendering, and applications start taking advantage of its limitless capabilities, Intel can start selling many-core CPUs to gaming enthusiasts as well. It will take many years, but heterogeneous computing is doomed.

Note that the convergence toward homogeneous computing is happening from both ends. GPUs have to add ever more programmability, and as things scale up the only way to defeat Amdahl's Law is to start caring about single-threaded performance. Eventually the GPU has to adopt ever more CPU-like traits, up to the point where it makes no sense to have two similar architectures sit next to each other. The communication overhead makes heterogeneous computing inefficient at complex workloads.

Which brings us to the fact that software, including games, continues to diversify both inter-app and intra-app. A homogeneous architecture can deal with that more efficiently by keeping data more local, and minimizes the complexity for developers.
 
It will take many years, but heterogeneous computing is doomed.
I think claiming that everything will be handled by OoOE CPUs (even with very wide vector units) is completely absurd. There is an argument to be made for a shared ISA between heterogeneous scalar-centric OoOE cores and vector-centric throughput cores that may (optionally) appear homogeneous to non-expert programmers. Both could even be x86/AVX-based CPUs. I don't think it's likely, but it does make some sense. But true homogeneous computing for everything? That sounds like crazy talk to me.

and as things scale up the only way to defeat Amdahl's Law is to start caring about single-threaded performance.
I must confess that I've always had some difficulty taking anyone seriously who believes Amdahl's Law is a *fundamental* limitation (within the next 25 years) rather than simply a bad choice of algorithms. In the real world, it is essentially an economic limitation - there simply isn't enough time and money to optimise or even get your basic architecture right. And this is true for all kinds of optimisation - not just parallelism.

The communication overhead makes heterogeneous computing inefficient at complex workloads.
That is true, but the communication overhead between two heterogeneous cores is no different than that between two homogeneous ones. It's not as if requesting data from a distant L3 hub was free. And as your core becomes bigger and bigger, the communication overhead *inside* the core increases as well! It's not just about lost instructions. If your architecture is properly designed, going to external memory will always be a few orders of magnitude more expensive.
 
Both could even be x86/AVX-based CPUs. I don't think it's likely, but it does make some sense. But true homogeneous computing for everything? That sounds like crazy talk to me.

There was a recent article in CACM, or ACM Queue, by some Intel engineers looking at how future micro-architecture is likely to evolve to deal with the dynamic power constraints, and issues such as wire limits (delay and energy), and circuit reliability.

The gist of what they proposed is that future CPUs will be more heterogeneous and will incorporate application specific accelerators. The transistor budget will be huge but power will constraint how much of a die can be active, so it will be desirable to super-optimise circuits for specific tasks and the large transistor budget will make that affordable. I certainly wouldn't bet on the future of Intel CPUs being a homogeneous x86 design, no matter how programmer friendly that may sound.

Of course the software side of this issue is a massive cluster-f*ck. :)
 
It has been confirmed that Haswell will also support FMA: http://origin-software.intel.com/en...l-new-instruction-descriptions-now-available/. Note that "our floating-point multiply accumulate significantly increases peak flops" also pretty much confirms Haswell will feature two 256-bit FMA units per core.
My assumption would be that two FMAC units would allow for a consistent improvement over the previous generation, something that AMD relied upon for Bulldozer (though with FMA4).
Intel has not shied away from saying an FMAC is 2x FLOPs.
I will go by the assumption for now that it is an omission that the disclosure for FMA3 did not say "doubles peak".

The 22 nm + FinFET advantage makes an 8-core mainstream CPU capable of 1 TFLOP quite feasible.
Am I to take it that you've been informed of Haswell's die size and TDP targets?
For the mainstream in particular?

I would caution not to oversell FinFET in this instance. Intel claimed its density is comparable to a planar process at the same node, and its "two node" advantage was not claimed for the voltage and performance envelope we are discussing.
There are some disclosed items that would take non-trivial area increases to implement, and it is not yet disclosed that these could be replicated 8x on a mainstream die.

The only thing lacking is reducing out-of-order power consumption by executing 1024-bit instructions on 256-bit execution units.
This would save on decoder power consumption. Implementation details can significantly impact the rest of the claim. Merely cracking the operation would result in essentially no difference from the POV of the execution engine.

There was a recent article in CACM, or ACM Queue, by some Intel engineers looking at how future micro-architecture is likely to evolve to deal with the dynamic power constraints, and issues such as wire limits (delay and energy), and circuit reliability.

The gist of what they proposed is that future CPUs will be more heterogeneous and will incorporate application specific accelerators. The transistor budget will be huge but power will constraint how much of a die can be active, so it will be desirable to super-optimise circuits for specific tasks and the large transistor budget will make that affordable. I certainly wouldn't bet on the future of Intel CPUs being a homogeneous x86 design, no matter how programmer friendly that may sound.
Psshh, what would they know...

More seriously, Intel is not a monolithic entity, and there are many points of view. There may be no small amount of competition internally over the eventual direction, although this prediction is consistent with multiple presentations over the last decade. One presentation had ISA consistency throughout the chip, but the individual computation units were very different in their design.
 
I think claiming that everything will be handled by OoOE CPUs (even with very wide vector units) is completely absurd.
If done right, executing 1024-bit instructions on 256-bit execution units would dramatically lower the power consumption of out-of-order execution. A 1024-bit register could be implemented as four 256-bit physical registers. These get fed into the pipelines sequentially, in four clock cycles. This sequencing doesn't have to involve any of the complex out-of-order execution. So it should get close to the power consumption efficiency of in-order execution.
But true homogeneous computing for everything? That sounds like crazy talk to me.
Think outside the box.
I must confess that I've always had some difficulty taking anyone seriously who believes Amdahl's Law is a *fundamental* limitation (within the next 25 years) rather than simply a bad choice of algorithms. In the real world, it is essentially an economic limitation - there simply isn't enough time and money to optimise or even get your basic architecture right. And this is true for all kinds of optimisation - not just parallelism.
Indeed, Amdahl's Law is more of a practical limitation. But a very serious limitation nonetheless. It's why the transputer ended up in a museum. Providing a straightforward programming model is critical to an architecture's success. As the software becomes an ever more complex mix of workloads, a homogeneous architecture featuring both high serial performance and high throughput becomes preferable.
That is true, but the communication overhead between two heterogeneous cores is no different than that between two homogeneous ones.
Then don't communicate between them (or keep it to a bare minimum). The advantage of homogeneous cores is that you can switch between serial and parallel algorithms on the spot.
 
There was a recent article in CACM, or ACM Queue, by some Intel engineers looking at how future micro-architecture is likely to evolve to deal with the dynamic power constraints, and issues such as wire limits (delay and energy), and circuit reliability.

The gist of what they proposed is that future CPUs will be more heterogeneous and will incorporate application specific accelerators. The transistor budget will be huge but power will constraint how much of a die can be active, so it will be desirable to super-optimise circuits for specific tasks and the large transistor budget will make that affordable. I certainly wouldn't bet on the future of Intel CPUs being a homogeneous x86 design, no matter how programmer friendly that may sound.

Of course the software side of this issue is a massive cluster-f*ck. :)
ARM agrees with this line of thinking.
 
Am I to take it that you've been informed of Haswell's die size and TDP targets?
For the mainstream in particular?
No. But a quad-core Sandy Bridge is actually quite small compared to for instance Lynnfield (fabbed at 45 nm), so especially without an IGP there should be room for up to 8 cores on a reasonably sized 22 nm die. It all depends on what Intel wants to sell at mainstream prices of course (which in turn largely depends on AMD). I'm just looking at the technical feasibility for now. If nothing else, the 16 nm shrink will definitely bring TFLOP performance to mortals.
This would save on decoder power consumption. Implementation details can significantly impact the rest of the claim. Merely cracking the operation would result in essentially no difference from the POV of the execution engine.
Indeed, splitting it into multiple independent uops would hardly help. But it should be quite possible to treat a single uop as a 'bundle' of four operations, no? Kind of like micro-op fusion...
One presentation had ISA consistency throughout the chip, but the individual computation units were very different in their design.
Such an architecture doesn't make sense, due to bandwidth. The transistor budget increases faster than bandwidth, so it's pointless to try to save die space and sacrifice performance. Also, such an architecture is a major pain to develop for. Nobody wants to design software that can run on an unknown configuration of cores (both from the same vendor and from other vendors). Also, how would an operating system schedule the threads? There really is no way backward. Every core should perform the same (or better) than a core from the previous generation.
 
Last edited by a moderator:
If done right, executing 1024-bit instructions on 256-bit execution units would dramatically lower the power consumption of out-of-order execution. A 1024-bit register could be implemented as four 256-bit physical registers. These get fed into the pipelines sequentially, in four clock cycles. This sequencing doesn't have to involve any of the complex out-of-order execution.

Would you want the scheduler to stop checking for instructions meant for other execution units? Would you want the memory speculation to shut down as well while this is happening?
 
Would you want the scheduler to stop checking for instructions meant for other execution units? Would you want the memory speculation to shut down as well while this is happening?
Wouldn't there automatically be less switching activity? Also, I imagine certain scheduler components can be clock gated as long as no new results arrive. Note that FinFET offers an advantage in static power consumption.
 
ARM agrees with this line of thinking.
Jem Davies: "Processing workloads will need to be moved to the particular area of the chip capable of executing that workload the most efficiently."

Moving data around also consumes power, and the latency overhead can kill performance. So effective performance/Watt may not be higher at all. In any case heterogeneous architectures are not a scalable solution, and they're hard to develop complex applications for.

Instead, we need each core to have a well balanced ISA, with possibly some task specific instructions. The most valuable instructions are those which are still fairly generic. Vector gather and FMA are great examples. x86 is an ugly duckling but the reality is that it's quite efficient at many workloads. AVX2 expands on this success.

The real challenge going forward is to find ways to lower the power consumption of control logic. That's why I'm suggesting to execute 1024-bit instructions as four sequenced 256-bit operations. The sequencing logic could be quite tiny, not unlike how GPU's process wide vectors on more narrow ALUs...
 
Indeed, splitting it into multiple independent uops would hardly help. But it should be quite possible to treat a single uop as a 'bundle' of four operations, no? Kind of like micro-op fusion...
It would be treating a vector operation as a non-pipelined instruction, such as a divide.
It's certainly possible. The preference is for single-cycle or pipelined instructions, possibly for scheduling reasons.

Such an architecture doesn't make sense, due to bandwidth. The transistor budget increases faster than bandwidth, so it's pointless to try to save die space and sacrifice performance.
Intel put out that possibility in the presentation. Whatever reasons they had or have were not fully outlined in the slide. Possibly, it conserved power and provided higher performance in a way that provided higher utility to the customer.
 
Wouldn't there automatically be less switching activity?
Not necessarily. As each cycle lapses, there would be more instructions in the buffer, causing the scheduler to look for all possible multiple issue opportunities. It does nothing about the sub-scheduler responsible for serial instructions.

Besides, clock gating for >~3 cycles would be next to impossible in the sanctum sanctorum of the core for any reasonable workload, saving almost no power.
 
Moving data around also consumes power, and the latency overhead can kill performance.
Data will go via L3 - which is inclusive on Intel anyway - with any cachelines marked as M hopefully kicked downstairs. Even with an exclusive L3 like on AMD, you would want the data crossing cores to be out of L1 and L2 anyway, to save excessive coherency induced stalls. Latency will be ~25 cycles, hardly anything to worry about for large kernels.

So effective performance/Watt may not be higher at all.
There's a OoM difference in perf/W and growing. It will be higher for well designed systems.
In any case heterogeneous architectures are not a scalable solution, and they're hard to develop complex applications for.
That's because developers don't have much experience.

Instead, we need each core to have a well balanced ISA, with possibly some task specific instructions. The most valuable instructions are those which are still fairly generic. Vector gather and FMA are great examples. x86 is an ugly duckling but the reality is that it's quite efficient at many workloads. AVX2 expands on this success.
That way the entire core is powered on all the time, killing your perf/W.
The real challenge going forward is to find ways to lower the power consumption of control logic. That's why I'm suggesting to execute 1024-bit instructions as four sequenced 256-bit operations. The sequencing logic could be quite tiny, not unlike how GPU's process wide vectors on more narrow ALUs...
With realistic hw, there's an upper limit of 3-4 instructions per clock and we are already there. Serial perf is a dead end.
 
To go back to the implementation of gather, here are some speculative ideas on how it could be done.

*edit: wall o'text to follow*

One of the larger barriers to implementing a non-trivial gather is the necessary routing of elements in one position in a loaded datum to the position in the destination register. Some kind of arbitrary permute or combination of operations is necessary.
Such a unit is not without cost to area and power, and it would seem wasteful to have this permute capability only for gather.

Enter Intel's new permute instruction. I think this hardware could be used by both the permutes and gather. Implementation details are nonexistent at this time, but it would defray some of the costs of adding gather by also adding an instruction that fixes another weak spot for Intel's vector extensions.

Another bit of work that is needed if we try a gather that is little more than N standard loads chained together is a way to evaluate the index values for the gather to see if they are neighbors on the same cache line, based on their values and the value of the base register loaded from the scalar pipeline.
If we don't rely on a stream of microcoded instructions to do this, we need a logic block that can spit out a list of cache line neighbors in order to elide duplicate loads. There is currently no guarantee that Haswell will attempt this, but it would help reduce the latency and power cost of using the instruction.
The comparison would be an N-way check of the arithmetic difference of each index value + the lowest significant bits of the base register needed to pick a cache line. The resulting values would be checked for where they lie on a cache line.
This could be done with a stream of instructions, though it would be faster if specialised hardware helped.

The next question would be the number and width of Haswell's memory ports, which could change some details as well.

The area-saving form of gather would be a microcoded instruction with possibly some hardware assist. Without some assist, it would save some area, but it would rapidly devolve into a dumb loop of loads. The exact process would be load base, check proximity (generate load list, masks), issue load(s), permute(s). Each phase contributing some number of cycles, perhaps 1 or two for the base and proximity checks, 4 or so for each load, and some unknown number of cycles for the permute.

As to the costs of implementing this and the claim of Intel implementing two 256-bit FMAs, along with the permute block, there could be many things that could change the equation.

If we look at Bulldozer with 2 128-bit FMAs and a vector permute block, we see an FPU that is something shy of half the size of the 2 integer cores in a module.
Within the FPU, the XBAR block is between the two register file halves, and would be about 20-25% of the area of the rest of the FPU (and it is only 128 bits). There are other ops in that block of course (horizontal ops, etc), which Haswell will have as well, so growth is not on the permute block alone.
An FMA is somewhat smaller than having 1 FMUL and 1 FADD. It goes to follow two of them is somewhat smaller than doubling the ALU pipes in the FPU of Haswell.

There will be expansion in the FPUs. First, the integer ops are promoted, which will expand the units.
The operand buses will be larger to handle the additional demand.
The permute block will be present, and the FMA blocks would be somewhat less than twice the area of just having double of everything.

Haswell is also a new core design, so the integer side would be growing as well.
I also think a gather instruction will heavily involve a significant portion of the FPU and some of the integer side. It would save a bit on instruction cache space. Power-wise, it skips the decoder, though on the downside the microcode engine typically blocks further issue until it is done.

A micrograph of Haswell's die would be interesting.
 
Last edited by a moderator:
It would be treating a vector operation as a non-pipelined instruction, such as a divide.
It's certainly possible. The preference is for single-cycle or pipelined instructions, possibly for scheduling reasons.
I think a non-pipelined instruction is different because the entire register is fed into the ALU on the first cycle. Since 1024-bit physical registers and data paths are not realistic I imagine four 256-bit registers are needed instead, accessed sequentially. But this complicates register renaming and scheduling since a single instruction could involve 12 physical registers (16 for FMA4).

Given that the scheduler is already capable of keeping track of sub-register dependencies makes me optimistic there might be a clever way to bundle four registers together, but it's clearly not without challenges. Intel's focus on performance/Watt probably means they're seriously exploring all possibilities though...
Intel put out that possibility in the presentation. Whatever reasons they had or have were not fully outlined in the slide. Possibly, it conserved power and provided higher performance in a way that provided higher utility to the customer.
Intel puts out a lot of presentations just to satisfy investor expectations. Perhaps 1 out of 10 ideas the academic research team comes up with, ends up in a product on the shelves. Remember the plans to scale NetBurst beyond 10 GHz? The only certainty about such long-term plans is that nothing is certain.

I think Quick Sync shows more clearly where things might be going. Tasks that don't require any data to go back and forth can be implemented in dedicated fixed-function components. The vast majority of die space stays dedicated to programmable cores though. In my opinion that's the sort of semi-heterogeneous architectures developers can deal with. The exciting bit is the homogeneous cores.
 
Back
Top