This is why I said a few pages ago that I see x86 as a wise choice for LRB, so long as it doesn't have a meaningful impact on current-day performance.
At least the fast to market part is demonstrably false for Larrabee.Larrabee is a revolutionary new design, and the overhead of x86 is negligible compared to the advantages. Fast to market, an abundance of existing tools, workload migration, extendability, etc.
I forget how many compiler back ends have been optimized for a half-width P54 core with a strap-on 512-bit masked vector ISA.The reason for Larrabee's delay is definitely not x86. On the contrary, any other ISA choice would take far longer to achieve competitive performance. Any theoretical performance advantage would be totally nullified by an initial lack of software optimization.
No. This has been discussed already, and the penalty is much more significant, particularly if the core has no additional OoO hardware to hide the penalty.We're only talking about a few percent of x86 decoder overhead anyway.
I thought your work was an example of something that already does this.The reason Larrabee is delayed is because it's still a revolutionary new approach to use a fully generic device for rasterization.
From a commercial perspective, this is likely far from ideal and not the apparent direction the development world is taking.Ideally Larrabee should be programmed directly by the application developer. The potential is huge (as proven by FQuake). The problem is it will take many years to go that route.
So you're saying that until the tools exist to abstract away concurrent concepts, the devs will have to settle on using tools that abstract away concurrent concepts that we already have.We still need a lot of progress in development tools (such as explicitely concurrent programming languages - inspired by hardware description languages). Till the day this becomes as obvious and advantageous as object-oriented programming, application developers expect APIs to handle the hardest tasks.
As mentioned before, unless you have an interest in integrating a Larrabee core into a chip with an x86 CPU socket, the x86 is of no real import.Of course GPUs are also evolving toward greater programmability, and APIs are getting thinner to allow more direct access to the hardware. But Intel is attempting to skip ahead. Even though we won't see a Larrabee GPU in 2010, x86 is enabling Intel to get to the convergence point much faster than anyone else.
Why? I think you might be stuck in the idea that a discrete device has to be controlled by an API, so you can pick an ISA that suits the API(s). For Larrabee, x86 is the API, and everything else is a layer on top of that. That's possible without x86 as well, but no equally generic ISA offers any substantial benefit over it. And none of them have such massive existing software base.If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.
Why? I think you might be stuck in the idea that a discrete device has to be controlled by an API, so you can pick an ISA that suits the API(s).
For Larrabee, x86 is the API, and everything else is a layer on top of that. That's possible without x86 as well, but no equally generic ISA offers any substantial benefit over it. And none of them have such massive existing software base.
Also, why can't it be both meant for a CPU socket and a PCI-c slot? It makes no sense for Intel to write different (mediocre) drivers for each generation of IPGs and HPC devices and discrete GPUs. With x86 they can focus the effort, and the application developers will follow...
Are you honestly arguing that it would be easy to associate the "VPU" with a non-x86 ISA on a PCI-E part and then switch to x86 with trivial software changes for a CPU-embedded part?Except that the focus is on the vpu and not the associated x86 crap.
Well, the original athlon 64 was a single core, and x86 decoder used up 10% of the die. When you shrink down the ooo bloat, that overhead is going to shoot right up. Dunno, how much can the vpu amortize that fat.
Umm, wasn't the PPro the first x86 to use a decoder/internal ISA?
-Charlie
If you have millions of lines of code, do you expect people to go through and hand-tune everything? Of course not, you rely on your compiler to get most of the performance. Things that are used frequently may get hand-tuned in assembly, but it's pretty rare and a small portion of the total code.Optimizations are ISA-specific. I guess some people don't think about extracting the most performance from their code though.
If lrb i meant to go on a cpu socket, then x86 is the only ISA that makes sense.
If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.
Are you honestly arguing that it would be easy to associate the "VPU" with a non-x86 ISA on a PCI-E part and then switch to x86 with trivial software changes for a CPU-embedded part?
And furthermore, is anyone really *not* interested in the integration of throughput-computing devices/cores into CPU sockets in the long term? I can tell you with certainty that all three of the big IHVs and tons of ISVs in this space are...
If you have millions of lines of code, do you expect people to go through and hand-tune everything? Of course not, you rely on your compiler to get most of the performance. Things that are used frequently may get hand-tuned in assembly, but it's pretty rare and a small portion of the total code.
Optimizations are ISA-specific. I guess some people don't think about extracting the most performance from their code though.
A large body of optimizations are implementation-specific, and Larrabee is much more anemic on the x86 side than even the originating P54 core.
Which ISA-specific optimizations are particularly relevant to a single-issue x86 that runs a workload that should be dominated by the vector throughput? We wouldn't have any classic compilers that have any reference to the vector component needed for dual-issue?
Just how bad a compiler does one need to make getting decent utilization out of a single-issue chip in a very short time period, and why is it that no other ISA doesn't have a couple dozen compiler back-ends that targeted single-issue variants?
Umm, wasn't the PPro the first x86 to use a decoder/internal ISA?
Compatibility of legacy software aside, there's no need for heterogeneous processors on the same die/socket to use a different ISA. Most of the people I've spoken to consider a more unified ISA to be the end goal here, whether it be x86 or something else entirely, and consider the current state of having to target a pile of different ISAs far from ideal. Even with JIT (which is great of course) it's still a problem and definitely non-ideal. Sure you can make do on Cell-like models and such, but there's no question that it's harder and less flexible.Binary compatibility to run sw outside of the niche it was written for, is over-rated.
NexGen's Nx586 is probably the first commercial x86 CPU to do so (it's released about a year earlier than Pentium Pro).
Depends on what the software dev wants to do with their app. Isn't that the whole point of Larrabee? Programmability above all else (with decent performance). I agree that the easiest way to get any sort of usefulness out of Larrabee would be to target its vector extensions rather than to attempt to write a 3d engine in x86 from the ground-up, but the option is there.
Optimizations are ISA-specific.