Larrabee delayed to 2011 ?

Nick · Dec 16, 2009

OpenGL guy said:
I don't think they can. 1024 bits divided between a maximum of 16 cores would leave only 64 bits per core per cycle. That's only 2 Zs per core, so 1/8th of the full vector width. Maybe the next incarnation of Larrabee will be different, but this was what was discussed previously.

Edit: I think I see what you mean now. You're implying that each vector unit can use its local memory to write out 16-wide vectors per clock. That would be 256 Zs per clock, assuming nothing else interfered.

Yes, each core has a local subset of L2 cache memory, which I assume it can access at full speed. The 1024 bit ring bus connects the L2 subsets, texture samplers, and memory interface.

I agree that's not phenomenal though. R600 had a 1024 bit ring bus as well and apparently that didn't turn out to be a success...

rpg.314 · Dec 16, 2009

Nick said:
I mean, you start with C code or something equivalent, it gets compiled to x86 code, it gets decompiled into the original C operations, you perform whatever optimization applies to your processor, and output RISC code.

It's not so easy. You compile your C code, which has nothing to do with all of these instruction flags. You decompile your assembly to C and all those flag changes are now in your new C code. And then spend a lot of time to figure out which ones are useless. And all this is while you are spending cpu cycles, which could have otherwise be spent executing code.

IMHO, for a real ISA comparison, which we should compare perf/W/mm of cpu's runnig code in a non-jit language like python, ruby, lua etc to cancel the effects of compiler/ISA-specific optimizations.

3dilettante · Dec 16, 2009

Nick said:
But why? If RISC is so superior why not translate it into a sequence of superior RISC instructions?

A minor nitpick is that Transmeta's design was VLIW.

It is also a bit disingenuous to state that emulating a different architecture in software does not impose a penalty.
The addressing modes x86 has are more numerous, its page table handling is emulated, its full condition code behavior is supposed to be emulated, and the floating point pipeline is wider to support 80-bit registers, or do you suppose an ideal RISC could somehow emulate 80-bit FP with 64-bit registers without penalty?

These are things where tweaks to the architecture to help give the software a leg up can be a signficant win.

Atom is also a significantly smaller chip with reduced power consumption.

Efficeon topped out at 7 watts and was about 68mm2, with 1 MiB L2 cache, on-die memory controller, and HT link.
Atom is 25mm2 with half the L2, no memory controller and 3 watts TDP, and it sports 2 threads.

Efficeon was at 90nm.
I suppose we could imagine what could happen if it were on a process 4 times as dense.

I'm not sure if this is a feather in Transmeta's cap or an unfortunate reality of Atom.

That's indeed likely. The transistor density difference might be due to the dynamic logic versus CMOS. Either way it's hard to conclude that the Pentium was inferior due to x86.

Pentium was of an earlier age, and an earlier Alpha beat it by 1/3 in size, with better integer and FP performance.

Nick · Dec 16, 2009

rpg.314 said:
Really? Tell me that when you see atom beating arm at perf/mm/W. Come to think of it, we'll see how much ISA matters when ARM cortex A9 faces off with intel atom next year.

I'm curious about that. Do you have some extensive benchmark results comparing Atom and an ARM processor? I'm not implying that Atom would beat ARM at any metric, but I wonder how big the difference really is, and how that would impact something like Larrabee.

LRB's readiness for the HPC market is a matter of speculation at the moment.

Sure, but at least Intel didn't blow off that launch (yet) so it must (still) be confident that the hardware is competitive. Even if it doesn't make a big impression, it seems to imply that the hardware is less off-course than the software that would turn it into a GPU.

x86 might make sense for intel from a time-to-market pov. But it is clearly a luxury to developers in the near term, and it is not at all obvious if lrb can afford it. Also full cache coherency in hw is of limited utility when you have O(100) hw threads. Shared mutable state is a bane of parallel computing, then why should there be support for it in hw.

The fact that it supports full cache coherence is not an invitation to go and trash the caches of neighboring cores. It's the software's responsibility to keep a high locality of reference. However, coherency is there to make the developer's life easier for when data does have to implicitely cross from one core to another, or when threads occasionally migrate to another core. It also again means that it's straightforward to get a prototype running, and only later worry about performance, where it matters.

aaronspink · Dec 16, 2009

rpg.314 said:
On the sw development (by external devs) side, most of the code you speak of will need a recompile. And you never know when simple recompiles make porting tricky. My knowledge of CISC vs RISC wars is limited, but if recompiles were so simple, I imagine RISC vendors would not have had so much of a lack-of-sw problem.

Recompiles are easy. Always have been. And that has never been the issue. The issue has always been supporting the software, redoing all the validation, etc.

Lets put it this way. 64b windows was not developed on x86. Yet, MS never released a 64b version of windows for the platform it was developed on. This was not because of any compile or development issue. It was simply all the additional costs and obligations that would of occurred if they actually shipped on the platform it was developed on.

psurge · Dec 16, 2009

Nick said:
Why? There are x86 decompilers that reconstruct the C code perfectly (except formatting of course). They had every bit of opportunity to run this faster on a VLIW processor.

IIRC Transmeta did not cache recompiled binaries (I'm having a hard time seeing how they would do this without OS support). So they don't have the luxury of taking 10 minutes to recompile (say) a firefox binary. I also have a hard time believing that there are x86 decompilers that reproduce C code perfectly (including type and associated aliasing info). I'd be happy to be proved wrong though.

You mean to perform additional optimizations not performed on the x86 code? I'm sure Transmeta was able to run unoptimized x86 code faster than an x86 processor. Big deal. The real challenge is to run optimized x86 code faster. When that fails, let's blame the recompiler for not optimizing things the x86 compiler didn't spot either?

Not additional optimizations, just different optimization decisions. I'm thinking things like making different inlining decisions based on better knowledge of the code size/run-time of small functions and available register space. It's also not at all clear to me that things like loop unrolling decisions would be identical between architectures. Also, replacing trig functions and other compiler built-ins with instruction sequences specifically optimized for the target ISA might provide some benefit...

It makes no real difference. With or without source code you could rewrite x86 assembly for any other ISA and the number of operations (not instructions) and their dependencies remain the same.

The FX!32 emulator (runs x86 binaries on Alpha) provides a data point here, see: http://en.wikipedia.org/wiki/FX!32

They claim 40-50% of native x86 performance, with 70% possible using improved optimizations. Unfortunately, the wikipedia article does not give details on the HW used for the performance comparisons. There's a link to http://www.computer.org/portal/web/csdl/abs/mags/mi/1998/02/m2056abs.htm which probably provides more details, but I can't find a free copy online. But given the time-frame (1997-1998), I would think the Alpha HW used would compare favorably with the x86 HW of the day, which points to a pretty large penalty for this particular binary translator.

Now, it's true that Transmeta had the advantage of being able to design an ISA that lent itself to efficient x86 emulation and of intervening years of research, so the penalty they suffer should be significantly less. Still, it's not obvious to me at all that comparing ISAs using native vs. translated binaries is a valid thing to do.

aaronspink · Dec 16, 2009

darkblu said:
i really don't understand your logic about ISA acceptance - ISA's today are mostly judged by how well they play with their compilers (from coders' perspective), and meet their price/performance/wattage envelopes (from EE/business perspectives) - not by their proximity to their 40th anniversaries. look round you - chances are you'll find more devices hosting 'young' ISAs than such hosting x86 or older. so what reception and by whom are you concerned about?

Doubtful. The two most popular ISAs in the world were both designed/developed in the early 80s. Collectively they make up more than 50% of all processors sold each year.

darkblu · Dec 17, 2009

aaronspink said:
Doubtful. The two most popular ISAs in the world were both designed/developed in the early 80s. Collectively they make up more than 50% of all processors sold each year.

amazing what a paramount difference a decade (mid-70s to mid-80s) can do for cpu designs, eh?

also, i would not call ARM as found today (mainly in the forms of v4 & v5, less so in v6 - the ISA iterations that power the consumer-segment telecom industries (cell phones, home routers, etc)) a superset of v2 - in contrast to intel, ARM actually deprecate elements of their ISA. they also add totally optional extensions (in their reserved 'extension spaces' - i wonder when intel will invent that), which makes the ARM ISA a living, mutable beast.

rpg.314 · Dec 17, 2009

Nick said:
I'm curious about that. Do you have some extensive benchmark results comparing Atom and an ARM processor? I'm not implying that Atom would beat ARM at any metric, but I wonder how big the difference really is, and how that would impact something like Larrabee.

Like I said, we'll know the results next year.

The fact that it supports full cache coherence is not an invitation to go and trash the caches of neighboring cores. It's the software's responsibility to keep a high locality of reference. However, coherency is there to make the developer's life easier for when data does have to implicitely cross from one core to another, or when threads occasionally migrate to another core. It also again means that it's straightforward to get a prototype running, and only later worry about performance, where it matters.

Fine. But incremental costs involved in changing your code to stick to 192K data per core is going to be large as well, even if you have the benefit of getting the full code running at horrible speeds at the beginning. (It'll prolly end up as a full rewrite.)

But yes, running old code right away helps.

ban25 · Dec 17, 2009

psurge said:
Not additional optimizations, just different optimization decisions. I'm thinking things like making different inlining decisions based on better knowledge of the code size/run-time of small functions and available register space. It's also not at all clear to me that things like loop unrolling decisions would be identical between architectures. Also, replacing trig functions and other compiler built-ins with instruction sequences specifically optimized for the target ISA might provide some benefit...

I worked with 164 and 264 systems in the late '90s and FX!32 performance was very comparable to a high-end Pentium machine of the era. The high water mark was probably the 164 machines in 1997 because the competition was not as significant before the PII/P3 and K7. I really liked the Alpha platform and I acquired a large suite of native tools from the resources of the time (chiefly Aaron's AlphaNT Archive).

aaronspink · Dec 17, 2009

darkblu said:
amazing what a paramount difference a decade (mid-70s to mid-80s) can do for cpu designs, eh?

not really.

also, i would not call ARM as found today (mainly in the forms of v4 & v5, less so in v6 - the ISA iterations that power the consumer-segment telecom industries (cell phones, home routers, etc)) a superset of v2 - in contrast to intel, ARM actually deprecate elements of their ISA. they also add totally optional extensions (in their reserved 'extension spaces' - i wonder when intel will invent that), which makes the ARM ISA a living, mutable beast.

You mean like SSE/MMX/ETC? You obviously have no idea what you are talking about. Please do some research.

Nick · Dec 17, 2009

Voxilla said:
With 3dmark2001 I'm getting 8 Mtris/s

I'm getting 24.3 Mtris/s here. During this test only 20% of time is spent in primitive setup and rasterization. Scatter/gather support and other LRBni features would also help tremendously.

Voxilla · Dec 17, 2009

Nick said:
I'm getting 24.3 Mtris/s here. During this test only 20% of time is spent in primitive setup and rasterization. Scatter/gather support and other LRBni features would also help tremendously.

You must be using another version of your renderer. Anyway having thought in more detail about rasterizing small triangles in software in might be possible to do it quite fast, after all.

Organizing the color and Z buffer as 2x2 pixels, and calculating 2x2 coverage masks pixel planes style, should make it quite fast even with SSE.

darkblu · Dec 17, 2009

aaronspink said:
You mean like SSE/MMX/ETC? You obviously have no idea what you are talking about. Please do some research.

easy there cowboy, the idea i was trying to convey was that of the extensions being optional* (though i did word it poorly, perhaps with the wrong brackets, but you still don't have to assume the other party are idiots who haven't seen the opcode space of the dominant desktop ISA). point is, ARM's extension spaces are actually optional - and not for architecture licensees only, but for ARM themselves - you could have a contemporary ARM design with arbitrary extensions omitted, based on the targeted market**. intel's x86 extensions, just like their whole holy cow of an ISA, are an example of the opposite - deep entrenchment. IA32's whole philosophy is pretty much that of the tarball - everything it touches gets stuck on it indefinitely (because, gasp, once an opcode gets into windows there's no turning back!). of course, feel free to disprove that and quote the number of intel x86 designs released after the p3 era (or even at the era's end, if you like) that don't feature SSE, for instance. once you do that, you can repeat the same with the number of intel's x86 opcodes that have been deprecated (by intel themselves, of course) for a stronger impact. and you surely must be aware how this always-accumulate-never-reduce approach affects the decoding logic (how many escape codes are intel up to these days in x86?)

heck, with all the critique people (me incl) have been giving LRB, it's intel's strongest attempt to do something off the beaten path with x86. unfortunately, that has not brought up to anything yet.

* for instance, ARM have a reserved space for coprocessors (fp, etc), but that does not guarantee there's an actual implementation sitting in in a given design. moreover, the reserved space itself does not bother to strictly define the op set - it just broadly categorizes the opcode layout there (eg. MCR, MRC, processing), and stops at that. the strict, down-to-individual-opcode definition is left to the actual implementations of said reserved space. please, note the subtle difference between a reserved space and an implementation of said space.

** and even though it's not impossible that some extensions may go into the core specs with time, that does not mean a lot in ARM land where 'reduced core specs' designs are nothing uncommon.

Colourless · Dec 22, 2009

A fair number of x86 instructions were deprecated with x86-64, including all the x87 instructions. SSE was promoted to required instructions.

Deprecation can occur in x86 but in general its not wanted because backward compatibility is x86's big big advantage. Its an old extended instsruction set, and that is why its still here. Why would they bother implementing things like Real Mode and the hack Flat Real Mode in modern CPUs if backward compatibility wasn't considered to be a major major factor about why people buy x86 CPUs. Mostly deprecation in the instruction set become guidelines of what not to do in order to maintain good performance.

3dilettante · Dec 22, 2009

Perhaps the vendors have reserved the right to deprecate the x87 instructions from the hardware at some point, but current statements are that it is still present, and that it is more a question of software and OS support. A compiler flag can be set and the code recompiled to use x87 in 64-bit mode.

The hardware and opcodes are still devoted to it. Larrabee's early slides indicated it would still have non-SSE FP capability. That may have applied to an incarnation that predated what was recently cancelled.

Tahir2 · Jan 14, 2010

From what people know about Larrabee's architecture are there any limitations in the Larrabee architecture being unable to be DirectX 11 compliant?

For some reason I was under the impression there were some limitations to Larrabee's much touted flexibility that resulted in it not being 100% compliant.

3dilettante · Jan 14, 2010

I'm not aware of any such limitation.
The only part of the architecture that really can be pinned as being compliant to a particular version of DX would be the texture formats supported by the TEX blocks. The rest is handled by software that could be updated.

If there were some texture format not supported by the texture units, it could be construed as those parts are not compliant.

However, Intel would probably have had enough time to know the specification and designed the blocks to match. Barring that, any corner cases could be handled in software.

Tahir2 · Jan 15, 2010

Thanks for the response.
There was talk of the hardware tesselator not being present but then that too can be implemented through software and a "simple" problem to overcome. I wonder what a hardware or fixed function tesselator actually looks like.
Do we know enough about DX 12 and its texture formats and whether Larrabee 1 could in fact support it too? I guess know we are at the mercy of those academics and insititutes that will be able to actually play with Larrabee and develop it as a GPGPU.

Andrew Lauritzen · Jan 16, 2010

A recent talk by Tom Forsyth about Larrabee:
http://www.stanford.edu/class/ee380/

Larrabee delayed to 2011 ?

Nick

rpg.314

3dilettante

Nick

aaronspink

psurge

aaronspink

darkblu

rpg.314

ban25

aaronspink

Nick

Voxilla

darkblu

Colourless

Monochrome wench

3dilettante

Tahir2

3dilettante

Tahir2

Andrew Lauritzen

Moderator