22 nm Larrabee

Nick · Nov 22, 2011

rpg.314 said:
Because it's in a monopoly's shareholders' best interests are to sit tight and screw dollars on the penny from it's customers.

It's not a monopoly.

Nick · Nov 22, 2011

rpg.314 said:
This discussion has since long been derailed. If you think AMD is a threat, or poses a risk to Intel, then you should start a separate thread on how Intel might evolve AVX to compete with AMD and keep this one for KC and it's children.

I consider AVX(2) to be a child of LRBni.

rpg.314 · Nov 22, 2011

Nick said:
Why would it take "a lot" of area and power? Intel has doubled the SIMD execution width and cache bandwidth before while keeping power consumption in the same ballpark: T2700 -> T7800. Note the higher clock frequency, FSB speed, twice the cache, and x86-64 support, all on the same process node!

The price is not mentioned, so we cannot make a fair comparison, but one of the two is rated at $621, so it's for the 1% anyway. There is no mention of cache bandwidth increase and the 64 bit support is just a price gouging tactic.

Besides, they are both apparently from Conroe family, so I am not sure why you are saying that the vector width is different. Is that a difference that is not listed here?

No, Sandy Bridge has independent MUL + ADD. Applications would suffer badly if Haswell supported only one FMA since it can only execute dependent MUL/ADD and only when using FMA3 instructions.

I am assuming that the FMA unit will also be able to use it's multiplier and adder simultaneously. So that should sustain status quo at a minimum. Besides, the FMA might just decode to 2 uops anyway. We won't know until we see it.

And for the last time, this is the wrong thread for speculation on future Intel CPUs.

denev2004 · Nov 22, 2011

Nick said:
Not according to this: Shark Bay Platforms. There's a new ultrabook segment, but the desktop and laptop products target the same TDP levels as Sandy Bridge.

Didn't Intel said more power will be consuput by its own IGP, not CPU?
I think the CPU part has nearly the same power as the IVY Bridge

rpg.314 said:
I am assuming that the FMA unit will also be able to use it's multiplier and adder simultaneously. So that should sustain status quo at a minimum. Besides, the FMA might just decode to 2 uops anyway. We won't know until we see it.

And for the last time, this is the wrong thread for speculation on future Intel CPUs.

If so It's good. Most common application can't use both 256bit MUL and 256bit ADD simultaneously

Nick · Nov 22, 2011

denev2004 said:
Also, 4W is not really the same ballpark for notebook computer.

You totally missed the point. It has twice the SIMD width, twice the cache bandwidth, higher clock frequency, higher FSB speed, twice the cache size, and x86-64 support. Yet all of that only increased TDP by a fraction. It's an indication that doubling the SIMD width and cache bandwidth, the only things relevant to the discussion, likely won't increase power consumption by a lot. And given that Haswell will use a new process with exceptional energy efficiency, there should actually be plenty of headroom for ultrabook products, despite vastly increased throughput.

Nick · Nov 22, 2011

rpg.314 said:
The price is not mentioned, so we cannot make a fair comparison, but one of the two is rated at $621, so it's for the 1% anyway.

That's because I selected the highest clocked models in their TDP class, to get as close as possible to fair comparison. You're most welcome to compare other relevant models.

There is no mention of cache bandwidth increase and the 64 bit support is just a price gouging tactic.

Besides, they are both apparently from Conroe family, so I am not sure why you are saying that the vector width is different. Is that a difference that is not listed here?

T2700 is Yonah (Core Duo). T7800 is Merom (Core 2 Duo). Core 2 features Advanced Digital Media Boost and Advanced Smart Cache, or as I like to call it, twice the SIMD width and twice the cache bandwidth.

I am assuming that the FMA unit will also be able to use it's multiplier and adder simultaneously.

Please clarify what you mean by "simultaneously".

Besides, the FMA might just decode to 2 uops anyway.

Please explain how you'd implement a fused multiply-add operation in two uops.

And for the last time, this is the wrong thread for speculation on future Intel CPUs.

You might want to check who started this thread. I also suggested early on that AVX could use LRBni type instructions, in particular gather. So when AVX2 was announced the discussion naturally started to focus on CPUs instead of MICs.

Nick · Nov 22, 2011

denev2004 said:
Most common application can't use both 256bit MUL and 256bit ADD simultaneously

Why would it be any different from 128-bit SSE?

denev2004 · Nov 22, 2011

Nick said:
Why would it be any different from 128-bit SSE?

I just mean they don't need it.

Nick said:
Please clarify what you mean by "simultaneously".
?

Is that the same as SSE?

Nick · Nov 22, 2011

denev2004 said:
I just mean they don't need it.

In what sense? Are you referring to the 256-bit, the separate ports for MUL and ADD, or just floating-point performance in general? Please motivate.

Keep in mind that AVX2 doubles integer SIMD width and adds gather support. It makes auto-vectorization a lot easier. So it has the potential to eventually speed up pretty much everything.

Is that the same as SSE?

All SSE implementations I know of have separate ports for MUL and ADD.

fellix · Nov 22, 2011

denev2004 · Nov 23, 2011

Nick said:
In what sense? Are you referring to the 256-bit, the separate ports for MUL and ADD, or just floating-point performance in general? Please motivate.

I just start to wondering wether program need both MUL and ADD at excatly the same time.

Nick · Nov 23, 2011

denev2004 said:
I just start to wondering wether program need both MUL and ADD at excatly the same time.

Of course they do. As you can see in the image fellix posted, Sandy Bridge considers up to 54 uops for execution each cycle. Chances of finding both an independent MUL and ADD in floating-point intensive code is quite high. As I've mentioned earlier, Intel has had separate ports for MUL and ADD since the Pentium Pro. Also note that with Hyper-Threading the instructions come from two threads, further increasing the chances of finding independent instructions.

Having two FMA units would be vastly superior though. Not only can it execute a FMA & FMA combination, it can also execute MUL & FMA, ADD & FMA, ADD & MUL, ADD & ADD, and MUL & MUL. The latter two combinations also help with legacy code. AMD already has two FMA units in Bulldozer, and although they're only 128-bit each, it would clearly be a huge mistake for Intel to equip Haswell with only one 256-bit FMA unit. In particular for legacy code it would only be capable of executing either a MUL or an ADD each cycle, instead of all the above combinations.

It's also worth noting that with Sandy Bridge all three execution ports already feature a full scalar integer ALU for basic operations. So these won't become a bottleneck when you have mixed code. To further increase scalar integer execution rate, it has also been suggested that Haswell may feature more advanced macro-op fusion, enabling it to execute dependent MOV + ALU instructions as if they were one instruction with non-destructive operands. Note that this merely requires minor decoder changes.

Nick · Nov 23, 2011

Here's one more reason why 2 x 256-bit FMA for Haswell makes most sense:

Ivy Bridge will feature support for 16-bit floating-point values, which is very useful for software vertex processing. It would be a waste to leave the CPU cores' GFLOPS unused while the IGP is swamped. Since Haswell is expected to have an incrementally faster IGP, the CPU cores have to keep up and 2 x 256-bit FMA would offer that while also improving performance / Watt over Ivy Bridge.

It would provide a direct return on investment, on top of what the lower 128-bit FMA lanes offers for legacy SSE applications. Note that Intel really has no interest in creating something that sacrifices CPU performance like Fusion instead. That would be a risky bet on GPGPU and make them very vulnerable to the competition's offerings. Instead Intel needs to keep doing what it does best; providing high performance generic processors that allow developers to create any sort of application.

Software vertex processing, if you really want to call it that, would be part of a gentle transition toward a homogeneous architecture. And as GPUs continue to become more programmable, Intel can afford transitioning more work to the CPU side. For instance when the ROP's tasks are moved to the shader cores the GPU's performance / Watt would become worse (compensated by new process technology), allowing Intel to also move more work to software while remaining competitive. It avoids the problems Larrabee faced, while still extending the use of x86 and benefiting from its software ecosystem.

3dilettante · Nov 23, 2011

Nick said:
Ivy Bridge will feature support for 16-bit floating-point values, which is very useful for software vertex processing. It would be a waste to leave the CPU cores' GFLOPS unused while the IGP is swamped.

That's only a loss if the CPUs can't be power gated off due to inactivity.

Software vertex processing, if you really want to call it that, would be part of a gentle transition toward a homogeneous architecture.

Intel did this for the previous generations of GMA hardware before transitioning away from doing so.
It could be due for a pendulum swing back, if taking work from specialized low-clocking hardware and firing up one or more high-speed OoO engines is supposed to be a power efficiency gain, which is not the impression I'm getting so far.

Gipsel · Nov 23, 2011

Nick said:
Here's one more reason why 2 x 256-bit FMA for Haswell makes most sense:

Ivy Bridge will feature support for 16-bit floating-point values, which is very useful for software vertex processing.

Vertex processing?!? IIRC the vertex shaders were first to support 32bit floats before the pixel shaders could do it (back in the times before unified shaders got common).

Nick · Nov 23, 2011

3dilettante said:
That's only a loss if the CPUs can't be power gated off due to inactivity.

Games are becoming increasingly more multi-threaded, as are the drivers, so you can largely forget about gating off CPU cores anyhow.

Intel did this for the previous generations of GMA hardware before transitioning away from doing so.

The main problem was that they relied on Microsoft's software vertex processing. Unfortunately some games downright refuse to run on Direct3D implementations that don't report hardware vertex processing support, for no good reason.

It could be due for a pendulum swing back, if taking work from specialized low-clocking hardware and firing up one or more high-speed OoO engines is supposed to be a power efficiency gain, which is not the impression I'm getting so far.

Power efficiency is not the all-determining factor to (not) make a move like this. If it was, we'd still have separate vetex and pixel shader cores. Heck I'm absolutely certain that the vast majority of people wouldn't care if speeding up Intel's graphics required increased power consumption. Note also that there was a clear pendulum swing for both dedicated sound and physics processing, and there's no sign of it ever coming back.

At a certain point a more generic core is fully adequate and since it serves more than one role and offers future potential it's a better value, regardless of a certain power consumption discrepancy. And there's still AVX-1024 to have things swing back in favor of the CPU even more.

Nick · Nov 23, 2011

Gipsel said:
Vertex processing?!? IIRC the vertex shaders were first to support 32bit floats before the pixel shaders could do it (back in the times before unified shaders got common).

These instructions I'm talking about are for converting between 16-bit and 32-bit floating-point formats. Actual arithmetic operations still use at least 32-bit. But 16-bit floating-point formats have a major use case for vertex processing, since it's a popular compact data type for vertex buffers. Gather support is also quite useful for vertex processing since it avoids having to explicitly transpose the data.

Gipsel · Nov 23, 2011

And I thought it was common because of the 16 bit (per component) formats for textures and the framebuffer.

nAo · Nov 23, 2011

half precision coordinates for geometry is often not such a good choice as it makes quite hard to avoid cracks between meshes (think about environment geometry split into multiple objects), better use fixed point coordinates (shorts).

cal_guy · Nov 28, 2011

Nick said:
Having two FMA units would be vastly superior though. Not only can it execute a FMA & FMA combination, it can also execute MUL & FMA, ADD & FMA, ADD & MUL, ADD & ADD, and MUL & MUL. The latter two combinations also help with legacy code. AMD already has two FMA units in Bulldozer, and although they're only 128-bit each, it would clearly be a huge mistake for Intel to equip Haswell with only one 256-bit FMA unit. In particular for legacy code it would only be capable of executing either a MUL or an ADD each cycle, instead of all the above combinations.

What about 1 256-bit ADD and 1 256-bit FMA/MUL only? Bulldozer does something similar on the Vector Integer side.

22 nm Larrabee

Nick

Nick

rpg.314

denev2004

Nick

Nick

Nick

denev2004

Nick

fellix

denev2004

Nick

Nick

3dilettante

Gipsel

Nick

Nick

Gipsel

nAo

Nutella Nutellae

cal_guy

Similar threads