22 nm Larrabee

Nick · Nov 21, 2011

rpg.314 said:
The primary design target had it's power budget gutted in half. One of their biggest customers threatened to cut them out and their executives acknowledged it as a real wake up call.

Recognizing the potential for a new market with low power demands in now way implies they're going to sacrifice performance for every other market. It only means they should be even more focused on the performance / Watt metric. And that's what AVX eventually offers, so it fits the plan perfectly.

You should also realize that the Haswell design must have been near completion by the time Apple urged them to create lower power processors. And besides, they've got fast 17 Watt CPUs based on Sandy Bridge today. Since Tri-Gate offers substantial advantages to decrease power consumption, and AVX can be heavily clock gated, investing into 2 x 256-bit FMA shouldn't be much of a problem, while still being able to hit the 15 Watt design goal for ultrabooks.

What changes have happened on the scheduling side in the last ~5 years?

First of all, 5 years is a relatively short time with only a couple major architectural changes, and GPU manufacturers are pretty secretive about these things. Still, over a period of merely 10 years GPUs have evolved from fixed-function to highly generic computing devices! That should tell you something, and it's pointless to discuss the individual changes that got us here. The relevant bit is that contemporary GPUs have complex scheduling not entirely unlike CPU schedulers. Sandy Bridge has a 54-entry scheduler, while Fermi deals with 48 potential warps each cycle. Also worth noting is that GF104 feeds a total of 7 execution ports, and uses superscalar scheduling. AMD will take a big leap from mostly static scheduling to dynamic scheduling with GCN. The incentive for adding such complexity is to avoid running out of thread context storage (registers) and improve cache hit rates to reduce bandwidth.

And I sincerely doubt that will be the last change to instruction scheduling for GPUs this decade. So it really doesn't make sense to say DLP is an afterthought for CPUs just because the scheduling is focussing on ILP. It's where GPUs are heading too and it doesn't make them any less DLP focussed. It matters little from a complexity and power consumption perspective where the instructions come from. But as things continue to scale you do want to minimize the thread count, and for that you need a healthy amount of ILP.

Turning the CPU into a high throughput architecture is within reach. Any remaining advantage GPUs have over CPUs would be addressed by AVX-1024.

RecessionCone · Nov 21, 2011

All the FMAs in the world won't make up for the lack of memory bandwidth. Until Intel starts selling a CPU soldered directly into a board, using small and expensive GDDR, throughput computations will be held up by lack of memory throughput on CPUs.

hoho · Nov 21, 2011

RecessionCone said:
All the FMAs in the world won't make up for the lack of memory bandwidth.

Yeah but big caches and decent prefetching kind of balances it out. Obviously you'll fare much better if you rework your algorithms to not stream through same data over and over again as it's done in GPUs but to take a small chunk, do everything you can with it and then take another one.

3dilettante · Nov 21, 2011

Nick said:
Since Tri-Gate offers substantial advantages to decrease power consumption, and AVX can be heavily clock gated, investing into 2 x 256-bit FMA shouldn't be much of a problem, while still being able to hit the 15 Watt design goal for ultrabooks.

I think it is quite possible that Haswell has 2xFMA. The promotion of INT to YMM, the rumored increase in port size, permute, and 2xFMA, in addition to some probable changes in the integer pipeline are likely contributors to die size and power consumption that puts Haswell in the same power envelope as Sandy Bridge.
With roughly double the transistor budget and a reduction of almost half in per-transistor power, the desktop range yields a doubling of FP capability.

AMD will take a big leap from mostly static scheduling to dynamic scheduling with GCN. The incentive for adding such complexity is to avoid running out of thread context storage (registers) and improve cache hit rates to reduce bandwidth.

I wouldn't characterize it as being dynamic. Instruction issue is strictly scalar and in-order, outside of specific instances of compiler-directed runahead.
A more pressing concern for AMD was that its clause-based central scheduler was starting to come undone with more varied code mixes and branching. Additional features and quality of service changes such as interruptibility were also increasingly hostile to that approach.

Turning the CPU into a high throughput architecture is within reach. Any remaining advantage GPUs have over CPUs would be addressed by AVX-1024.

That's two vector widenings away, and I worry you are overselling the benefits. It has certain advantages, but I do not think it is as revolutionary as you prosthelytize.

This does not mean I necessarily care about GPGPU, which is more easily combated with SIMD extensions, to a point. My concern is the proper mixture of dedicated and generalized hardware to best meet performance and power needs.
Homogenous solutions do not meet Haswell's desktop targets, where four cores will likely be sufficient to max out TDP in the same manner as they can for SB.

Voxilla · Nov 21, 2011

Nick said:
You should also realize that the Haswell design must have been near completion by the time Apple urged them to create lower power processors.

It won't take too long I think before Apple switches to ARM, once 64 bit has arrived this is pretty obvious.
In the long run I don't even see much future in x86/x64, price and power consumption will make it uncompetitive. In 10 year even Intel will have to switch to ARM.

3dilettante · Nov 21, 2011

That's using some very optimistic thinking in the favor of ARM.
The situation is less cut and dry at the performance levels of non-embedded x86 CPUs. There could be an advantage to having ARM, but there are many things Intel has that can counter the advantages of a more regular ISA.

We should wait and see how ARM does. It is currently not trying to take x86 on directly, and the ISA is not a dominating factor at the performance ranges in question. There are very, very strong advantages for Intel.

Nick · Nov 21, 2011

fuboi said:
IMHO nobody cares about FP performance. It's not decisive for almost any benchmark (who really cares about cinebench?), much less for any real non-niche application. Yet you argument as if FP is what is making billions for intel. Bulldozer could have otherwise 4x times the FP performance of SB and lower IPC than phenom2 and nobody would buy but niche vertical applications.

It would be very noticeable if floating-point performance was crippled. For instance every game using the UT3 engine demands SSE2 support, and I'm sure there are plenty of other examples. Keep in mind that the CPU is a reliable source of low latency floating-point processing power, so it's used for things like physics, collision detection, audio, procedural effects, particles, animation, AI scripts, visibility determination, LOD calculations, illumination, etc. Also there are plenty of libraries and driver components which use FP SIMD behind the scenes.

Business software, scientific computing, content creation, and many other fields/markets also expect competitive floating-point performance. So yes, Intel would easily lose billions if Haswell had just a single 256-bit FMA unit.

Also why would Intel double the width of floating-point operations with AVX, and why would AMD implement FMA4, if they didn't expect it to really matter? The GPGPU folks seem to think floating-point performance is paramount, so if Intel doesn't want to lose billions to NVIDIA and AMD it has to beef up its CPU's floating-point throughput.

rpg.314 · Nov 21, 2011

hoho said:
Yeah but big caches and decent prefetching kind of balances it out.

No, it does not for any working set that is bigger than cache size.

Nick · Nov 21, 2011

rpg.314 said:
I mentioned that the cost of associated structures that have to be beefed up as well. Which will take a lot of area and power when you have already lost half of your power budget.

Why would it take "a lot" of area and power? Intel has doubled the SIMD execution width and cache bandwidth before while keeping power consumption in the same ballpark: T2700 -> T7800. Note the higher clock frequency, FSB speed, twice the cache, and x86-64 support, all on the same process node!

So they should have no problem whatsoever creating processors suited for ultrabooks, with 2 x 256-bit FMA and plenty of bandwidth, on a pretty revolutionary 22 nm Tri-Gate process.

At this point you launched an entirely pointless exercise of reaching <some arbitrary bogoflops number> by using <arbitrary vector size, clocks, issue width and cores>.

There's nothing arbitrary about it. We know for a fact that Haswell will support AVX2 + FMA3. So we know 1 x 256-bit FMA is the absolute minimum and 2 x 256-bit FMA is a reasonable maximum based on the Sandy Bridge architecture and looking at the competition and past enhancements.

If it's a pointless exercise to you to determine which of these is most likely, or something in between, then don't feel obliged to participate in the discussion.

The core argument is that there is no room for luxuries like extra FP units when power budgets are shrinking, and there are no applications out there that could use them in that timeframe while the cost involved is more than the encoding scheme to do all that.

Power budgets are not shrinking overall. They're just extending it on the lower end. And judging by the Sandy Bridge low power models, the superior 22 nm Tri-Gate process, the ability to clock gate the extra FP units (and adjusting the Turbo Boost frequency when they're active), and Intel's past achievements, it don't see why you would think they have to cripple FP performance and performance / Watt instead of improve it.

Furthermore, any application using SSE or AVX would benefit from 2 x 256-bit FMA units, since it can also execute 2 x ADD or 2 x MUL each clock.

SB has just 1x256 MAD (effectively, as it can do one add and one mul/clock) unit. I don't see any fp app suffering. Quite the contrary.

No, Sandy Bridge has independent MUL + ADD. Applications would suffer badly if Haswell supported only one FMA since it can only execute dependent MUL/ADD and only when using FMA3 instructions.

Nick · Nov 21, 2011

denev2004 said:
According to Intel's pdf in IDF, Intel's AVX in Snb can be also seen as a combination of two SSE units.

That's their software development manual, and as far as I know it doesn't talk about hardware units at all. They merely try to explain that 256-bit AVX operations are split into two 128-bit lanes on which a legacy SSE operation is performed (except for a handful cross-lane instructions).

rpg.314 · Nov 21, 2011

Nick said:
Recognizing the potential for a new market with low power demands in now way implies they're going to sacrifice performance for every other market. It only means they should be even more focused on the performance / Watt metric. And that's what AVX eventually offers, so it fits the plan perfectly.

Optimizing one thing means, by definition, not optimizing the rest.

Also, not increasing performance along one axis is not sacrificing it when there is no competition around.

You should also realize that the Haswell design must have been near completion by the time Apple urged them to create lower power processors. And besides, they've got fast 17 Watt CPUs based on Sandy Bridge today. Since Tri-Gate offers substantial advantages to decrease power consumption, and AVX can be heavily clock gated, investing into 2 x 256-bit FMA shouldn't be much of a problem, while still being able to hit the 15 Watt design goal for ultrabooks.

Haswell has been designed around low power, through and through. You don't throw hw at things which are not needed when power budget is cut in half, your prize customers are screaming at you, and you face a mortal enemy in ARM+Win8.

First of all, 5 years is a relatively short time with only a couple major architectural changes, and GPU manufacturers are pretty secretive about these things. Still, over a period of merely 10 years GPUs have evolved from fixed-function to highly generic computing devices!
That should tell you something, and it's pointless to discuss the individual changes that got us here.

Then why claim that "ever more complicated changes" have happened?

The relevant bit is that contemporary GPUs have complex scheduling not entirely unlike CPU schedulers. Sandy Bridge has a 54-entry scheduler, while Fermi deals with 48 potential warps each cycle. Also worth noting is that GF104 feeds a total of 7 execution ports, and uses superscalar scheduling. AMD will take a big leap from mostly static scheduling to dynamic scheduling with GCN. The incentive for adding such complexity is to avoid running out of thread context storage (registers) and improve cache hit rates to reduce bandwidth.

GCN is still almost completely statically scheduled. It does not have to care for instruction latencies, and memory induced switches are stamped by compiler. The similarities between SB and GCN's scheduler end at the number of entries. SB has to handle instruction latencies, instruction dependencies, L1/L2 misses....

And I sincerely doubt that will be the last change to instruction scheduling for GPUs this decade. So it really doesn't make sense to say DLP is an afterthought for CPUs just because the scheduling is focussing on ILP. It's where GPUs are heading too and it doesn't make them any less DLP focussed. It matters little from a complexity and power consumption perspective where the instructions come from. But as things continue to scale you do want to minimize the thread count, and for that you need a healthy amount of ILP.

Forget scheduling, pretty much everything apart from serial integer IPC has been done by taking the crumbs that the integer core does not eat.

Nick · Nov 21, 2011

ninelven said:
Really.... I shouldn't need to, but... you can start by looking at the gross margin of the two companies followed by the net profit margin, and debt/equity ratio. Intel can lower their prices considerably and still make a profit; AMD can not. There is a reason AMD just fired 10% of their workforce.

All very true, and yet I don't see how any of that means AMD can't pose a threat to Intel. Again, just because Intel can afford to have a lower profit margin, doesn't mean that's what the investors want. The best guarantee for Intel to keep making billions, is to simultaneously increase performance and lower power consumption. Clock gated 2 x 256-bit FMA and 22 nm Tri-Gate can give them exactly that, so why would they make risky compromises instead? It's all about performance / Watt, so what better choice would they have?

rpg.314 · Nov 21, 2011

denev2004 said:
Well, Doesn't Fermi claim it has?
Just ignore RV770 and NI. They do not really design an un-core system for general computing.

The original claim was that "ever more complicated changes" have happened to increase cache hit rates.

I disputed this as there has been only one (or zero) generation with caches on the market, so it makes no sense to say that there have been changes to increase cache hits, changes which are pointless to discuss anyway.

rpg.314 · Nov 21, 2011

Nick said:
All very true, and yet I don't see how any of that means AMD can't pose a threat to Intel. Again, just because Intel can afford to have a lower profit margin, doesn't mean that's what the investors want. The best guarantee for Intel to keep making billions, is to simultaneously increase performance and lower power consumption. Clock gated 2 x 256-bit FMA and 22 nm Tri-Gate can give them exactly that, so why would they make risky compromises instead? It's all about performance / Watt, so what better choice would they have?

Because it's in a monopoly's shareholders' best interests are to sit tight and screw dollars on the penny from it's customers.

Nick · Nov 22, 2011

RecessionCone said:
All the FMAs in the world won't make up for the lack of memory bandwidth. Until Intel starts selling a CPU soldered directly into a board, using small and expensive GDDR, throughput computations will be held up by lack of memory throughput on CPUs.

The A8-3850's GPU delivers 480 GFLOPS and uses ordinary DDR3-1866. A Haswell quad-core with two 256-bit FMA units per core would offer comparable floating-point performance. The L3 cache helps too, as would less aggressive prefetching for AVX-1024.

And DDR4 will offer ample bandwidth scaling for years to come. Note that AMD aims to deliver APUs which achieve 10 TFLOPS, by 2020. To compete with that Intel has to keep a focus on the AVX roadmap.

rpg.314 · Nov 22, 2011

Nick said:
Note that AMD aims to deliver APUs which achieve 10 TFLOPS, by 2020. To compete with that Intel has to keep a focus on the AVX roadmap.

This discussion has since long been derailed. If you think AMD is a threat, or poses a risk to Intel, then you should start a separate thread on how Intel might evolve AVX to compete with AMD and keep this one for KC and it's children.

Nick · Nov 22, 2011

3dilettante said:
That's two vector widenings away, and I worry you are overselling the benefits. It has certain advantages, but I do not think it is as revolutionary as you prosthelytize.

Executing AVX-1024 instructions on 256-bit units merely requires 96 extra registers (probably less), not widening anything. And I don't really consider it revolutionary. It doesn't offer higher peak throughput. What it does offer is the ability to keep the cache size reasonable by allowing to cover more miss latency. And it also provides new clock gating opportunities for the front-end, allowing to scale to more cores when the transistor budget allows it but the power consumption would otherwise not. Indirectly it may provide higher performance, but not a whole lot.

This does not mean I necessarily care about GPGPU, which is more easily combated with SIMD extensions, to a point. My concern is the proper mixture of dedicated and generalized hardware to best meet performance and power needs.

What goes into the ISA is certainly of critical importance. A vectorized version of BMI2 might be worth considering.

denev2004 · Nov 22, 2011

3dilettante said:
This does not mean I necessarily care about GPGPU, which is more easily combated with SIMD extensions, to a point. My concern is the proper mixture of dedicated and generalized hardware to best meet performance and power needs.
Homogenous solutions do not meet Haswell's desktop targets, where four cores will likely be sufficient to max out TDP in the same manner as they can for SB.

Actually, shouldn't this thread concern about 22nm LRB?
22nm LRB and 22nm Haswell may go on the market at the same time.
The work of fighting against GPGPU is LRB's job, which has a 512bit LNI.

Nick said:
That's their software development manual, and as far as I know it doesn't talk about hardware units at all. They merely try to explain that 256-bit AVX operations are split into two 128-bit lanes on which a legacy SSE operation is performed (except for a handful cross-lane instructions).

If legacy software can have the ability to execute like that why should we pay attention on wether its hardware units shall be combined or not....

rpg.314 said:
The original claim was that "ever more complicated changes" have happened to increase cache hit rates.

I disputed this as there has been only one (or zero) generation with caches on the market, so it makes no sense to say that there have been changes to increase cache hits, changes which are pointless to discuss anyway.

Well, They don't need much till now. So only one generation change they think will be enough. But not for a fully programmable design

denev2004 · Nov 22, 2011

Nick said:
Why would it take "a lot" of area and power? Intel has doubled the SIMD execution width and cache bandwidth before while keeping power consumption in the same ballpark: T2700 -> T7800. Note the higher clock frequency, FSB speed, twice the cache, and x86-64 support, all on the same process node!

They have a significant difference in voltage..
Also, 4W is not really the same ballpark for notebook computer.
Actually I think the reason why 2600K with AVX doesn't seem quite out of power-consumption is it start to use physical register which reduce a lot of space.

Nick · Nov 22, 2011

rpg.314 said:
Optimizing one thing means, by definition, not optimizing the rest.

Sure, but your premise is incorrect. With 2 x 256-bit FMA, Haswell would not only optimize floating-point performance. We also know it will be manufactured on a vastly superior process, optimizing everything.

Also, not increasing performance along one axis is not sacrificing it when there is no competition around.

You still fail to see that it's impossible for Haswell to maintain the same performance level. We know for a fact that it supports FMA, and we also know that with a single FMA unit the practical performance would be lower than with independent MUL and ADD units. If it does that, it becomes very vulnerable to the competition.

FMA + ADD would be the closest thing to not increasing performance and not sacrificing legacy performance. But it's of little use due to the prevalence of multiplications. In other words, it would be a waste of transistors and power, exactly what they want least of all. 2 x FMA also costs transistors and power, but offers a substantial increase in performance in return.

Haswell has been designed around low power, through and through.

Not according to this: Shark Bay Platforms. There's a new ultrabook segment, but the desktop and laptop products target the same TDP levels as Sandy Bridge.

22 nm Larrabee

Similar threads