Larrabee delayed to 2011 ?

For a sphere, procedural basically means intersecting a ray with x² + y² + z² = r². It's just a buzzword that is used to indicate that the scene has non-polygonal objects. Sphere's have always been the simplest objects to ray-trace.
Surely an infinite ground plane will beat a sphere any day :)
 
I know for a fact that they are also targeting the HPC sector with larrabee, and in these applications sometimes people optimize the assembler code. Specially with a new chip and infant compiler optimizations.
Sure, but also remember that Larrabee's vector ISA is designed to be an easy target for compilers, even more than being easy to use directly. Also, like it or not, part of the advantage of being x86 is that there are compilers that have been tuned over many years that spit out pretty ideal code for it.

Thanks for the links about x86 hardware costs - I'll dig in later today as I'm quite curious :)
 
And would you agree that in the case of larrabee using a RISC arch. results in them being able to pack in more cores? That's why I believe it is even more relevant when talking about this chip instead of when comparing the core i7 vs other big RISC CPUs.

I've tried guesstimating this in other threads.
The die shots and some other information that maybe 30-40%of the core+L2 area is neither L2 or vector-related.
Without a real labeled micrograph, this is highly speculative.
RISC/x86 comparisons of cores of the P54 era put a penalty of about 1/3 extra transistors.

Let's just say it might fall to about 15% penalty for the core area.
This leaves out the third of Larrabee that isn't core at all.

The chip in general could be 10% smaller, or the 2/3 of the chip that are cores could have 15% more cores.
At 32, that's about an extra 5 cores.
This assumes that the x86 portion could be switched with an otherwise identical non-x86 core.

So x86 has a measurable impact, but it may be dwarfed by the choice of going for a fully featured core as a base unit.
There are some very tiny embedded cores and possibly some other smaller cores that could give similar results with even less space, but I didn't have numbers for those.

edit:
It is also quite possible that the x86 portion of Larrabee is significantly more pared down than the P54, in that it may only be about half as wide. If this is the case, the x86 penalty may be worse than assumed.
 
I havent found anything conclusive, but is Larrabee going support the even more horrible x87 FPU aswell?
If so, then thats even more wasted diespace (times number of cores) as noone would use it in favor of SSE2, and if not then this would ensure that you cant just run old software on those cores (which doesn't makes much sense to me, but I thought this would be the strongest reason to actually use x86)

@3dilettante: thanks for the links and explanations, way better than I could've done.
 
Not so easy. If you put more cores the uncore has to scale as well (perhaps in a non trivial way). A 15% in core size reduction is not probably going to give you 15% more cores.
 
Slides from a while back indicated that Larrabee could perform 2 non-SSE DP FLOPs a cycle.
That would seem to indicate x87, though the slides are pretty old at this point.

SSE wouldn't be an option anyway, as it appears Larrabee does not support it.
 
Not so easy. If you put more cores the uncore has to scale as well (perhaps in a non trivial way). A 15% in core size reduction is not probably going to give you 15% more cores.

It would be like adding 2 additional cores to each 16-core ring and having an extra on one of them.
Larrabee's high-level uncore architecture would appear to be the closest to being able to trivially add cores, and Intel has made a point of how easy it is to add them.
Barring some undisclosed implementation detail of the ring bus, which might add complications, the rest of the chip needn't care.

It is also the case that Larrabee's x86 section may be less capable than the version I used to calculate my initial figures, and the percentage may be higher.
 
I see your point, but I wasn't thinking about the ring. If you throw more cores you probably want do adjust the cores<->texture units ratio. And what about mem bandwidth? :)
So it's definitely not so simple..
 
Last edited:
The memory bandwidth is going to go up and down in rather coarse increments, unless Intel puts in a fraction of a controller or pushes clocks to a slightly higher range.
The ratio of cores to texture blocks in the die shot appeared to be 32:8.
Add 4 cores and one texture block, and nothing changes in that regard either.

It can always go in the other direction where Larrabee as a whole is 10% smaller.
 
Sure, but also remember that Larrabee's vector ISA is designed to be an easy target for compilers, even more than being easy to use directly. Also, like it or not, part of the advantage of being x86 is that there are compilers that have been tuned over many years that spit out pretty ideal code for it.
In general, coplexity of x86 makes it difficult to optimize for. So the accumulated compiler experience does not make x86 an advantage, but at best a neutralized disadvantage. A simple, sane ISA, without tons of asymetries and anachronisms of x86 can only be better, from the point of view of compiler.

There is another thing: x86 isn't ISA of the metal. In fact, output of the x86 compiler is consumed by another "compiler", the hairy decode+OoOE engine on die. And that layer is variable. Historical example: P6 was initially clock-for-clock slower then P5 in practical conditions. Compilers had to adapt to P6-friendly x86 subset and rules to make it win. Of course, this does not apply to Larrabee, because the compiler and HW are so close to each other. But for exactly the same reason, neither does apply large part of knowledge of fine tuning x86 compilers.
 
For the most part the x86 is just there for branching ... all the real code is running on the vector engine. Vector and easy target don't belong together though. It's up to the developer to structure the algorithms to run well as fibers with coherent branching paths.

I still say 5 wide VLIW would be the ideal core (with vertical and maybe simultaneous multithreading). Screw vectors, just one huge array of 5 wide VLIW cores, hierarchical caching with independent banked ports with 32 bit granularity and simple-COMA for coherency, some dedicated synchronous message passing support and a mesh network for transporting data. That would be pretty close to my ideal architecture.

No more worrying about coherent branch paths unless you need to run scalar code. Multithreaded programming is a big enough headache without vectorization.
 
A simple, sane ISA, without tons of asymetries and anachronisms of x86 can only be better, from the point of view of compiler.
Sure, but I don't think you can find too many problems with the vector ISA on Larrabee... coming from a GPGPU guy, it's easily as straightforward as any other instruction sets, be they virtual, semi-virtual (x8 uop style) or real.

And agreed that at best it's "not a disadvantage", but there's still some utility to being able to run legacy applications, so there's still a question of whether or not the additional hardware complexity (compared to some "from scratch" ISA that served similar needs) is really that bad compared to the advantages.

In any case, the majority of the power is in the vector units on Larrabee, so I think that's the more interesting ISA to be looking at as far as programming and compilers goes.
 
And agreed that at best it's "not a disadvantage", but there's still some utility to being able to run legacy applications, so there's still a question of whether or not the additional hardware complexity (compared to some "from scratch" ISA that served similar needs) is really that bad compared to the advantages.

I don't get this point: As long as I have the C/C++ sources for my legacy application, I can recompile it for _any_ architecture; if it's written in assembly, then it dependens on x86, but it's likely using SSE anyway (why bother writing assembly otherwise?), which won't run anyway. Plus you surely have a C version of the part you optimised in assembly.

For me, it's still a mystery why they use x86, when they still require a full recompile to run on LRB -- I guess it's the Intel heritage, and maybe they hope to easily adopt VTune/Debuggers compared to YetAnotherISA, but the legacy argument is rather weak.
 
If your code has any MMX/SSEx intrinsics, then sure as hell it wont run on LRB even with a recompile. Unless Intel comes up with something that generates LRBni code out of those intrinsics. It is not hard, it has been done with gcc to emit avx instructions only for example. But intel will have to do it.
 
Why would you necessarily need a full recompile to run stuff on LRB?

As far as I understand it, it's not possible (yet?) to upload already existing binaries directly to LRB, the code has to be compiled again using the LRB C/C++ compiler. Maybe the calling convention is different, or some ABI issues, don't know -- I'm not sure whether they relaxed this already, but this is my last information on that issue. Of course, you can reasonably expect everything that you can compile on the host to run on LRB, which is not yet the case for CUDA (no virtual functions.)

And well, even if you can use existing binaries, all code with SSE instructions won't run anyway, so legacy means really plain-old-x86 without SSE, and practically no x64 code for that matter (unless they provide something which translates SSE, as RPG pointed out.)
 
Last edited by a moderator:
To me it seems that switching from x86 only makes sense for LRB if you're 100% convinced that these cores won't at some day in the future end up running your operating system kernel. That's not where LRB is right now of course, nor will it be for a few years hence, but in the longer term they may well end up doing so. Sticking with x86 makes options for those sorts of changes a much simpler transition for the end-users, and hence simpler to get into the market.
 
Back
Top