22 nm Larrabee

Any one had a look at latest ARM 64 ISA ? At least twice as good as x64 I would say.
Meh, ARM has way too much deadweight for truly lightweight cores ala Cell SPUs ... and once you hang some massive width SIMD unit off the side to compensate the differences are rather academic.
 
I don't think ISA has much thing to do with register file as long as they have a same approach through decode unit.

Anyway, Decode could be a part that ARM has strength.
 
Meh, ARM has way too much deadweight for truly lightweight cores ala Cell SPUs ... and once you hang some massive width SIMD unit off the side to compensate the differences are rather academic.

Yes indeed, Cell is dead end.
 
I don't think ISA has much thing to do with register file as long as they have a same approach through decode unit.

Anyway, Decode could be a part that ARM has strength.

Don't you have to encode the number of instruction registers somewhere in a 32 bit word ?
 
Yes indeed, Cell is dead end.
If you think narrow (ie. 4 way SIMD or better 4 way VLIW) cores are a dead end then what use is there for ARM? A slight improvement in area efficiency in what is only a small part of the core doesn't do you a lot of good ... and that is what ARM and x86 are in a Larrabee type architecture.
 
Having more than 16 architectural registers is a waste of encoding space. The only reason to desire more would be to perform unrolling for latency hiding... But there's a more efficient way to achieve that, and it's likely going to be supported by AVX: It can be extended to support 1024-bit operations, which can be executed on 256-bit units in four cycles. This only takes one more bit of encoding space, but offers four times more register space for implicit 'unrolling'.
 
Having more than 16 architectural registers is a waste of encoding space. The only reason to desire more would be to perform unrolling for latency hiding... But there's a more efficient way to achieve that, and it's likely going to be supported by AVX: It can be extended to support 1024-bit operations, which can be executed on 256-bit units in four cycles. This only takes one more bit of encoding space, but offers four times more register space for implicit 'unrolling'.

You are smart Nick, but don't think you are smarter than ARM engineers. This new 64 bit ISA is exactly what I hoped for, and I have quite a bit of experience in ARM coding.
There are 32 scalar registers and 32 SIMD registers, even FMA with 4 registers in one fixed 32 bit instruction...
 
Could anyone help me interpret the following: http://www.indeed.com/r/b1b4922f298d8c33

"Graphics and Media cluster's micro-architecture validation for iLRB (Larabee-3 slice for Haswell Client product and Discrete Larabee-3)"

It makes no sense to me that Haswell would get a Larrabee-based IGP, considering the gather and FMA support for AVX2 and its future extendability to AVX-1024 which would lower the power consumption.
 
Interesting. That schedule suggests a lot of confidence in LRB3.

I am hoping that Intel has added low level hooks for binning, rasterization and other hw and opened them up to compute.
 
I got news for everyone. ARM has decoders, and they have ugly variable length instructions.

It's obviously cleaner than x86, but it's all a matter of degrees.

DK
 
Outside of old-timey VLIW architectures with instructions that were the control signals for the ALUs, some amount of decode hardware is necessary.

The iLRB3 entry in that profile is interesting. I have not seen much detail on what is planned for Haswell's graphics slice. The terminology for the various graphical grades is similar to Sandy Bridge and Ivy Bridge, but would not be indicative of what hardware falls beneath the labels.
Earlier rumors hinted that there was some back and forth between initiatives based on GMA or LRB for what would go on-die, and iLRB3 may have been the contender from the LRB side.

One possibly tempting reason to include it would be if LRB3 is aligned with Haswell's new instructions, or is part of the alignment process.
It might also lead to questions on where the most aggressive gather implementation would be on the Haswell die, if LRB found its way there.
 
I told you so.
Told me what, precisely? Hats off if you got it right but let's not sell the skin before the bear is caught. Larrabee got cancelled and the plans to integrate it into Haswell may have gone down with it. Or it could just be the codename for the next evolutionary step in the GMA product line, without x86 compatibility.

Even if it's truely Larrabee, possibly with a hardware rasterizer and such, it could still be an intermediate step toward homogeneous computing. Then I would go "I told you so" several years later... ;-)
 
One possibly tempting reason to include it would be if LRB3 is aligned with Haswell's new instructions, or is part of the alignment process.
AVX2 isn't power efficient enough yet. We need AVX-1024 for that. But yes it could be part of a longer term convergence.
It might also lead to questions on where the most aggressive gather implementation would be on the Haswell die, if LRB found its way there.
According to Tom Piazza the theoretical scatter/gather performance of Ivy Bridge's IGP is 32 times higher than Sandy Bridge. He also reveals that in practice it's lower due to bank conflicts.

AVX2 doesn't support scatter, only gather. A one cacheline per clock implementation makes the most sense. It doesn't require any changes to the cache itself and doesn't complicate coherency.
 
Back
Top