AMD should look beyond vector and come up with matrix instructions.
Implementing them on GPU portion of the APU would be piece of cake.
The era of heterogeneous computing is about to end.
Why? Intel can scale this to a high number of cores and regain dominance over the HPC market. Also, the IGP will vanish since the CPU cores are far more powerful. Once the majority of low-end graphics solutions consists of software rendering, and applications start taking advantage of its limitless capabilities, Intel can start selling many-core CPUs to gaming enthusiasts as well. It will take many years, but heterogeneous computing is doomed.I do hope you're keeping this statement confined to the context of x86 as a remark that integrated GPUs will cease to grow.
I think claiming that everything will be handled by OoOE CPUs (even with very wide vector units) is completely absurd. There is an argument to be made for a shared ISA between heterogeneous scalar-centric OoOE cores and vector-centric throughput cores that may (optionally) appear homogeneous to non-expert programmers. Both could even be x86/AVX-based CPUs. I don't think it's likely, but it does make some sense. But true homogeneous computing for everything? That sounds like crazy talk to me.It will take many years, but heterogeneous computing is doomed.
I must confess that I've always had some difficulty taking anyone seriously who believes Amdahl's Law is a *fundamental* limitation (within the next 25 years) rather than simply a bad choice of algorithms. In the real world, it is essentially an economic limitation - there simply isn't enough time and money to optimise or even get your basic architecture right. And this is true for all kinds of optimisation - not just parallelism.and as things scale up the only way to defeat Amdahl's Law is to start caring about single-threaded performance.
That is true, but the communication overhead between two heterogeneous cores is no different than that between two homogeneous ones. It's not as if requesting data from a distant L3 hub was free. And as your core becomes bigger and bigger, the communication overhead *inside* the core increases as well! It's not just about lost instructions. If your architecture is properly designed, going to external memory will always be a few orders of magnitude more expensive.The communication overhead makes heterogeneous computing inefficient at complex workloads.
Both could even be x86/AVX-based CPUs. I don't think it's likely, but it does make some sense. But true homogeneous computing for everything? That sounds like crazy talk to me.
My assumption would be that two FMAC units would allow for a consistent improvement over the previous generation, something that AMD relied upon for Bulldozer (though with FMA4).It has been confirmed that Haswell will also support FMA: http://origin-software.intel.com/en...l-new-instruction-descriptions-now-available/. Note that "our floating-point multiply accumulate significantly increases peak flops" also pretty much confirms Haswell will feature two 256-bit FMA units per core.
Am I to take it that you've been informed of Haswell's die size and TDP targets?The 22 nm + FinFET advantage makes an 8-core mainstream CPU capable of 1 TFLOP quite feasible.
This would save on decoder power consumption. Implementation details can significantly impact the rest of the claim. Merely cracking the operation would result in essentially no difference from the POV of the execution engine.The only thing lacking is reducing out-of-order power consumption by executing 1024-bit instructions on 256-bit execution units.
Psshh, what would they know...There was a recent article in CACM, or ACM Queue, by some Intel engineers looking at how future micro-architecture is likely to evolve to deal with the dynamic power constraints, and issues such as wire limits (delay and energy), and circuit reliability.
The gist of what they proposed is that future CPUs will be more heterogeneous and will incorporate application specific accelerators. The transistor budget will be huge but power will constraint how much of a die can be active, so it will be desirable to super-optimise circuits for specific tasks and the large transistor budget will make that affordable. I certainly wouldn't bet on the future of Intel CPUs being a homogeneous x86 design, no matter how programmer friendly that may sound.
If done right, executing 1024-bit instructions on 256-bit execution units would dramatically lower the power consumption of out-of-order execution. A 1024-bit register could be implemented as four 256-bit physical registers. These get fed into the pipelines sequentially, in four clock cycles. This sequencing doesn't have to involve any of the complex out-of-order execution. So it should get close to the power consumption efficiency of in-order execution.I think claiming that everything will be handled by OoOE CPUs (even with very wide vector units) is completely absurd.
Think outside the box.But true homogeneous computing for everything? That sounds like crazy talk to me.
Indeed, Amdahl's Law is more of a practical limitation. But a very serious limitation nonetheless. It's why the transputer ended up in a museum. Providing a straightforward programming model is critical to an architecture's success. As the software becomes an ever more complex mix of workloads, a homogeneous architecture featuring both high serial performance and high throughput becomes preferable.I must confess that I've always had some difficulty taking anyone seriously who believes Amdahl's Law is a *fundamental* limitation (within the next 25 years) rather than simply a bad choice of algorithms. In the real world, it is essentially an economic limitation - there simply isn't enough time and money to optimise or even get your basic architecture right. And this is true for all kinds of optimisation - not just parallelism.
Then don't communicate between them (or keep it to a bare minimum). The advantage of homogeneous cores is that you can switch between serial and parallel algorithms on the spot.That is true, but the communication overhead between two heterogeneous cores is no different than that between two homogeneous ones.
ARM agrees with this line of thinking.There was a recent article in CACM, or ACM Queue, by some Intel engineers looking at how future micro-architecture is likely to evolve to deal with the dynamic power constraints, and issues such as wire limits (delay and energy), and circuit reliability.
The gist of what they proposed is that future CPUs will be more heterogeneous and will incorporate application specific accelerators. The transistor budget will be huge but power will constraint how much of a die can be active, so it will be desirable to super-optimise circuits for specific tasks and the large transistor budget will make that affordable. I certainly wouldn't bet on the future of Intel CPUs being a homogeneous x86 design, no matter how programmer friendly that may sound.
Of course the software side of this issue is a massive cluster-f*ck.
No. But a quad-core Sandy Bridge is actually quite small compared to for instance Lynnfield (fabbed at 45 nm), so especially without an IGP there should be room for up to 8 cores on a reasonably sized 22 nm die. It all depends on what Intel wants to sell at mainstream prices of course (which in turn largely depends on AMD). I'm just looking at the technical feasibility for now. If nothing else, the 16 nm shrink will definitely bring TFLOP performance to mortals.Am I to take it that you've been informed of Haswell's die size and TDP targets?
For the mainstream in particular?
Indeed, splitting it into multiple independent uops would hardly help. But it should be quite possible to treat a single uop as a 'bundle' of four operations, no? Kind of like micro-op fusion...This would save on decoder power consumption. Implementation details can significantly impact the rest of the claim. Merely cracking the operation would result in essentially no difference from the POV of the execution engine.
Such an architecture doesn't make sense, due to bandwidth. The transistor budget increases faster than bandwidth, so it's pointless to try to save die space and sacrifice performance. Also, such an architecture is a major pain to develop for. Nobody wants to design software that can run on an unknown configuration of cores (both from the same vendor and from other vendors). Also, how would an operating system schedule the threads? There really is no way backward. Every core should perform the same (or better) than a core from the previous generation.One presentation had ISA consistency throughout the chip, but the individual computation units were very different in their design.
If done right, executing 1024-bit instructions on 256-bit execution units would dramatically lower the power consumption of out-of-order execution. A 1024-bit register could be implemented as four 256-bit physical registers. These get fed into the pipelines sequentially, in four clock cycles. This sequencing doesn't have to involve any of the complex out-of-order execution.
Wouldn't there automatically be less switching activity? Also, I imagine certain scheduler components can be clock gated as long as no new results arrive. Note that FinFET offers an advantage in static power consumption.Would you want the scheduler to stop checking for instructions meant for other execution units? Would you want the memory speculation to shut down as well while this is happening?
Jem Davies: "Processing workloads will need to be moved to the particular area of the chip capable of executing that workload the most efficiently."ARM agrees with this line of thinking.
It would be treating a vector operation as a non-pipelined instruction, such as a divide.Indeed, splitting it into multiple independent uops would hardly help. But it should be quite possible to treat a single uop as a 'bundle' of four operations, no? Kind of like micro-op fusion...
Intel put out that possibility in the presentation. Whatever reasons they had or have were not fully outlined in the slide. Possibly, it conserved power and provided higher performance in a way that provided higher utility to the customer.Such an architecture doesn't make sense, due to bandwidth. The transistor budget increases faster than bandwidth, so it's pointless to try to save die space and sacrifice performance.
Not necessarily. As each cycle lapses, there would be more instructions in the buffer, causing the scheduler to look for all possible multiple issue opportunities. It does nothing about the sub-scheduler responsible for serial instructions.Wouldn't there automatically be less switching activity?
Data will go via L3 - which is inclusive on Intel anyway - with any cachelines marked as M hopefully kicked downstairs. Even with an exclusive L3 like on AMD, you would want the data crossing cores to be out of L1 and L2 anyway, to save excessive coherency induced stalls. Latency will be ~25 cycles, hardly anything to worry about for large kernels.Moving data around also consumes power, and the latency overhead can kill performance.
There's a OoM difference in perf/W and growing. It will be higher for well designed systems.So effective performance/Watt may not be higher at all.
That's because developers don't have much experience.In any case heterogeneous architectures are not a scalable solution, and they're hard to develop complex applications for.
That way the entire core is powered on all the time, killing your perf/W.Instead, we need each core to have a well balanced ISA, with possibly some task specific instructions. The most valuable instructions are those which are still fairly generic. Vector gather and FMA are great examples. x86 is an ugly duckling but the reality is that it's quite efficient at many workloads. AVX2 expands on this success.
With realistic hw, there's an upper limit of 3-4 instructions per clock and we are already there. Serial perf is a dead end.The real challenge going forward is to find ways to lower the power consumption of control logic. That's why I'm suggesting to execute 1024-bit instructions as four sequenced 256-bit operations. The sequencing logic could be quite tiny, not unlike how GPU's process wide vectors on more narrow ALUs...
I think a non-pipelined instruction is different because the entire register is fed into the ALU on the first cycle. Since 1024-bit physical registers and data paths are not realistic I imagine four 256-bit registers are needed instead, accessed sequentially. But this complicates register renaming and scheduling since a single instruction could involve 12 physical registers (16 for FMA4).It would be treating a vector operation as a non-pipelined instruction, such as a divide.
It's certainly possible. The preference is for single-cycle or pipelined instructions, possibly for scheduling reasons.
Intel puts out a lot of presentations just to satisfy investor expectations. Perhaps 1 out of 10 ideas the academic research team comes up with, ends up in a product on the shelves. Remember the plans to scale NetBurst beyond 10 GHz? The only certainty about such long-term plans is that nothing is certain.Intel put out that possibility in the presentation. Whatever reasons they had or have were not fully outlined in the slide. Possibly, it conserved power and provided higher performance in a way that provided higher utility to the customer.