22 nm Larrabee

rpg.314 · Nov 18, 2011

Nick said:
I'm convinced it will take three micro-instructions (one to extract the mask, one to perform the actual gather, and one to perform the final blend), with a maximum throughput of one instruction per cycle when all elements are from the same cache line. The micro-op breakdown is obvious considering that Haswell won't support FMA4 and the current vblendv instruction can take three 256-bit source registers thanks to a movmsk micro-instruction which extracts the mask and passes it as an immediate to a vblend micro-instruction.

I'm also expecting Haswell to feature one regular 256-bit load unit and one 256-bit gather unit per core. It needs the extra L1 bandwidth for the FMA instructions, and this setup would allow the gather unit to have a slightly higher latency and thus a reasonably power efficient implementation.

It would be irrational of Intel to aim for anything less. LRBni supports 512-bit gather, and they wouldn't add gather to AVX2 if it wasn't efficient. 2 x 256-bit execution begs for a high performance gather instruction. Also, there's nothing reasonable in between sequential extract/insert, and gathering from one cache line at a time, so it has to be the latter. And lastly note that the coherency rules for the instruction are consistent with such an implementation.

I think the decoding granularity for gather would be somewhere between 1 uop/cacheline and 1uop/page for the first implementation. Intel has a track record of exposing ISA and gradually speeding up the instruction.

Also, there is no reason to believe whatsoever that Haswell will have 2x256b FP hw/core. Quite the contrary.

Nick · Nov 18, 2011

rpg.314 said:
I think the decoding granularity for gather would be somewhere between 1 uop/cacheline and 1uop/page for the first implementation.

What do you mean? The decoder has no knowledge of the cachelines or pages involved.

Also, there is no reason to believe whatsoever that Haswell will have 2x256b FP hw/core. Quite the contrary.

Sandy Bridge already has 2 x 256-bit FP per core. And they're not going to risk letting AMD take the lead by doubling the width of the FlexFPU unit. Furthermore, Larrabee / Knight's Corner has 512-bit units. So why would you think Haswell will take a step back? What configuration do you have in mind instead?

rpg.314 · Nov 19, 2011

Nick said:
What do you mean? The decoder has no knowledge of the cachelines or pages involved.

I mean decoding to 8 load uops and then fusing them before TLB lookup and/or fusing the nearby entries in load queue. Should have explained it better.

Sandy Bridge already has 2 x 256-bit FP per core. And they're not going to risk letting AMD take the lead by doubling the width of the FlexFPU unit. Furthermore, Larrabee / Knight's Corner has 512-bit units. So why would you think Haswell will take a step back? What configuration do you have in mind instead?

I am counting SB as 1x256 as it does one mul/add per clock.

Haswell was designed with low power in mind above all else. There's now way they are throwing lots of FP hw at it.

AMD is in no position to compete. Intel has known this for ~5 years now.

mczak · Nov 19, 2011

rpg.314 said:
I am counting SB as 1x256 as it does one mul/add per clock.

So are you suggesting Haswell will only do one fma per clock?

rpg.314 · Nov 19, 2011

mczak said:
So are you suggesting Haswell will only do one fma per clock?

I'd be surprised if it did more.

Nick · Nov 19, 2011

rpg.314 said:
I mean decoding to 8 load uops and then fusing them before TLB lookup and/or fusing the nearby entries in load queue. Should have explained it better.

It can't be just 8 load uops. You need to extract the index values and insert the loaded elements, so you're looking at 24 uops. But that's no better than what we get today with a sequence of extractps/insertps instructions. Also, vgather takes a mask operand, and writes it back when the instruction gets interrupted, so where exactly would you take that into account?

Instead of splitting it at the decoder and fusing in the load unit, it's much simpler and more efficient to just split in the load unit, when required. First a vmovmskps uop extracts the mask, which is sent to the load unit together with the full VSIB data by the actual vgather uop. The load unit computes the virtual address of the first element in the mask, and checks which other elements access the same cache line. It queues this cache line fetch and the offsets for each of the elements it wants, and updates the mask. If there are remaining elements, in the next clock cycle the load unit computes the virtual address of the next element in the mask, and again check which other elements access the same cache line and queues it. This is repeated till all the elements are, well, gathered. The third uop writes back the result with a vblend operation.

I am counting SB as 1x256 as it does one mul/add per clock.

Having one execution port that can take a 256-bit MAD operation is nowhere near the same as having two ports for a 256-bit ADD and a 256-bit MUL. The latter can be turned into two 256-bit MAD ports by adding a third source operand, while the former would require an additional execution port, scheduler, bypass network, higher decoding rate, etc. Sandy Bridge already has the two 256-bit execution ports, so it's not a huge step to make it capable of two 256-bit MAD operations.

Haswell was designed with low power in mind above all else. There's now way they are throwing lots of FP hw at it.

Haswell chips will most likely range from 2 to 8+ cores, and 15 to 130 Watt TDP. Just because 22 nm Tri-Gate enables much lower power consumption at low voltage and frequency for the ultrabook market, doesn't mean it's not intended to outperform Sandy Bridge for the desktop and laptop market!

It's really all about performance/Watt. And if there's one thing that GPU architectures teach us, it's that wide vectors offer cheap and power efficient performance. So I don't see a single reason to doubt that Haswell will have a measly two 256-bit MAD units per core.

Also keep in mind that AVX-1024 would lower power consumption, but we need high performance AVX2 first. Gather also saves power compared to the alternative.

AMD is in no position to compete. Intel has known this for ~5 years now.

Bulldozer has two 128-bit MAD units per module. Turning this into two 256-bit units at the next process shrink is even simpler than what Intel has to do design wise to get there. Also last time I checked Intel doesn't have any 16-core CPUs for 523 dollar, nor any 8-core for 266 dollar, nor any 6-core for 130 dollar. So it would be a bad mistake to underestimate AMD at this point.

Nick · Nov 19, 2011

rpg.314 said:
I'd be surprised if it did more.

That would mean Bulldozer outperforms it at legacy 128-bit SSE code. Not gonna happen.

rpg.314 · Nov 19, 2011

Nick said:
It can't be just 8 load uops. You need to extract the index values and insert the loaded elements, so you're looking at 24 uops. But that's no better than what we get today with a sequence of extractps/insertps instructions. Also, vgather takes a mask operand, and writes it back when the instruction gets interrupted, so where exactly would you take that into account?

Instead of splitting it at the decoder and fusing in the load unit, it's much simpler and more efficient to just split in the load unit, when required. First a vmovmskps uop extracts the mask, which is sent to the load unit together with the full VSIB data by the actual vgather uop. The load unit computes the virtual address of the first element in the mask, and checks which other elements access the same cache line. It queues this cache line fetch and the offsets for each of the elements it wants, and updates the mask. If there are remaining elements, in the next clock cycle the load unit computes the virtual address of the next element in the mask, and again check which other elements access the same cache line and queues it. This is repeated till all the elements are, well, gathered. The third uop writes back the result with a vblend operation.

That much seems fair. But what about the TLB lookups?

Having one execution port that can take a 256-bit MAD operation is nowhere near the same as having two ports for a 256-bit ADD and a 256-bit MUL. The latter can be turned into two 256-bit MAD ports by adding a third source operand, while the former would require an additional execution port, scheduler, bypass network, higher decoding rate, etc. Sandy Bridge already has the two 256-bit execution ports, so it's not a huge step to make it capable of two 256-bit MAD operations.

What about the actual area/power of the multipliers and adders? What about the 2x throughput of L1 and L2? The effort involved is more than the encoding space for the third operand.

Haswell chips will most likely range from 2 to 8+ cores, and 15 to 130 Watt TDP. Just because 22 nm Tri-Gate enables much lower power consumption at low voltage and frequency for the ultrabook market, doesn't mean it's not intended to outperform Sandy Bridge for the desktop and laptop market!

It's 2 to 4 cores, from 15 to 95W.

I am ignoring the equivalent of Haswell-E.

rpg.314 · Nov 19, 2011

Nick said:
That would mean Bulldozer outperforms it at legacy 128-bit SSE code. Not gonna happen.

Doesn't BD lose to SB for SSE codes today? Will leaving FP alone somehow slowdown Haswell?

Nick · Nov 19, 2011

rpg.314 said:
That much seems fair. But what about the TLB lookups?

No difference there. The load unit issues one cache line fetch per cycle.

What about the actual area/power of the multipliers and adders? What about the 2x throughput of L1 and L2? The effort involved is more than the encoding space for the third operand.

Again, GPUs sustain very high throughput per Watt. So why would it be all that hard to equip a CPU with similar wide SIMD units and keep power consumption in check? Also don't forget the impressive characteristics of 22 nm Tri-Gate, which will have fully matured by the time Haswell hits the streets.

It's 2 to 4 cores, from 15 to 95W.

I am ignoring the equivalent of Haswell-E.

I'm not.

Nick · Nov 19, 2011

rpg.314 said:
Doesn't BD lose to SB for SSE codes today?

As far as I've gathered it's a wash. Sometimes having two MAD units helps because you can also execute two multiplies or two additions simultaneously, instead of just one multiplication and one addition. But Bulldozer suffers from high latencies.

Will leaving FP alone somehow slowdown Haswell?

Haswell adds FMA support. The way to do that and not make it slower than Bulldozer or Sandy Bridge, is by having two 256-bit MAD units.

rpg.314 · Nov 19, 2011

Nick said:
Again, GPUs sustain very high throughput per Watt. So why would it be all that hard to equip a CPU with similar wide SIMD units and keep power consumption in check? Also don't forget the impressive characteristics of 22 nm Tri-Gate, which will have fully matured by the time Haswell hits the streets.

That is just dodging the question.

I'm not.

If the 99% is not able to purchase it, then the 99% is going to care for it. And neither am I.

Haswell adds FMA support. The way to do that and not make it slower than Bulldozer or Sandy Bridge, is by having two 256-bit MAD units.

What's wrong with 1x256b MAD?

fellix · Nov 19, 2011

Two 256-bit FMA pipes per core could be an overkill, but we still don't know anything about the Haswell's load/store pipeline capabilities. Probably Intel could settle for asymmetric ALU design with FMA + MUL "co-issue" organization, or FMA + ADD.

ninelven · Nov 19, 2011

Nick said:
Also last time I checked Intel doesn't have any 16-core CPUs for 523 dollar, nor any 8-core for 266 dollar, nor any 6-core for 130 dollar. So it would be a bad mistake to underestimate AMD at this point.

Intel also tends to make $$$, not lose it.

mczak · Nov 20, 2011

fellix said:
Two 256-bit FMA pipes per core could be an overkill, but we still don't know anything about the Haswell's load/store pipeline capabilities. Probably Intel could settle for asymmetric ALU design with FMA + MUL "co-issue" organization, or FMA + ADD.

MUL + FMA doesn't make sense imho, if anything I'd suspect it would be ADD + FMA.
It makes the pipes asymmetric wrt operand fetch though which would be a somewhat odd design (and for pure MAD code the throughput would be the same as with some weird fma implementation using both pipes simultaneously, that is for throughput it would be exactly the same as now).

denev2004 · Nov 20, 2011

rpg.314 said:
That is just dodging the question.

I think what he means if GPU can deal with it with an even worse cache system, there should be a way for CPU.
Actually Intel's approach in desktop CPU is too conservative.
The die size of Itanium/Xeon MP is twice even three times as much as that in desktop CPU.
Intel might should sacrifice the rate of good products to some extened
Also, redesigning the cache system, I think a lager L2 might help.

rpg.314 said:
What's wrong with 1x256b MAD?

I just wondering wether it is enough...
Intel's idea is using vector system in order to reach a high level of DLP to gain high performance.
1x256b FMAD doesn't seen enough....Both NVIDIA's GPUs and AMD's SIMD units (I mean, both GPU and APU) will be highly programmable in 2013.

rpg.314 · Nov 20, 2011

denev2004 said:
I think what he means if GPU can deal with it with an even worse cache system, there should be a way for CPU.

No. Read that bit of conversation again.

When I pointed out that the cost of 2nd 256b FMA unit was more the encoding space for third operand, he responded with some thing to do with vectors and GPUs.

That does not in any way address the cost of adding the 2nd FP unit and the business advantage/engineering constraints trade-offs involved in a CPU whose primary design goal is low power.

I just wondering wether it is enough...
Intel's idea is using vector system in order to reach a high level of DLP to gain high performance.
1x256b FMAD doesn't seen enough....Both NVIDIA's GPUs and AMD's SIMD units (I mean, both GPU and APU) will be highly programmable in 2013.

A CPU can address DLP, at best, as an afterthought when it is designed with serial integer IPC/W above all else in the first place. It needs to delegate that bit to ... ahem ... specialists.

3dilettante · Nov 20, 2011

Having FMA pipes may actually lead to power savings with integer-only code.
Sandy Bridge already shows evidence of extensive clock gating that inserts warmup periods when the FPU has been idle for too long.

SB has to keep two AVX units online, one for the MUL and one for the ADD in case one of those ops comes by.
With an FMA, it only needs one pipe on to service both.
The warmup period may have extra steps from idle to full 256 bit FMA throughput.

denev2004 · Nov 20, 2011

rpg.314 said:
No. Read that bit of conversation again.

When I pointed out that the cost of 2nd 256b FMA unit was more the encoding space for third operand, he responded with some thing to do with vectors and GPUs.

That does not in any way address the cost of adding the 2nd FP unit and the business advantage/engineering constraints trade-offs involved in a CPU whose primary design goal is low power.

A CPU can address DLP, at best, as an afterthought when it is designed with serial integer IPC/W above all else in the first place. It needs to delegate that bit to ... ahem ... specialists.

Oh. I do forgot about INT IPC/W indeed. That does sounds reasonable in many application.

Nick · Nov 20, 2011

rpg.314 said:
That is just dodging the question.

How is that dodging the question? The GF116-400 has 192 FMA ALUs clocked at 1800 Mhz, on 40 nm. So why would it be such a big deal to have a quad-core Haswell with a mere 64 FMA ALUs clocked at 4 Ghz, on 22 nm Tri-Gate? And since Sandy Bridge has two 256-bit ports, we're over half way there already!

If the 99% is not able to purchase it, then the 99% is going to care for it. And neither am I.

For the desktop market there will indeed be few 8-core sales (though more than 1%). But that's not even the point. Haswell is -not- exclusively designed for low power. Tri-Gate enables a massive decrease in power consumption with a modest decrease in clock frequency. But it just as much enables higher performance at the same power consumption. So just because Haswell will be suited for ultrabooks doesn't mean they had to make any compromises for other markets.

What's wrong with 1x256b MAD?

Everything! All applications using floating-point calculations would suffer. x86 has had dual FP ports since the Pentium Pro, 16 years ago. And even if magically all applications would suddenly use FMA operations, it would still be slower than having a separate adder and multiplier.

22 nm Larrabee

rpg.314

Nick

rpg.314

mczak

rpg.314

Nick

Nick

rpg.314

rpg.314

Nick

Nick

rpg.314

fellix

ninelven

PM

mczak

denev2004

rpg.314

3dilettante

denev2004

Nick

Similar threads