Larrabee delayed to 2011 ?

Ailuros · Dec 10, 2009

We'll see in future LRB variants if x86 hw decoding was a wise decision after all or not.

rpg.314 · Dec 10, 2009

nutball said:
This is why I said a few pages ago that I see x86 as a wise choice for LRB, so long as it doesn't have a meaningful impact on current-day performance.

Well, the original athlon 64 was a single core, and x86 decoder used up 10% of the die. When you shrink down the ooo bloat, that overhead is going to shoot right up. Dunno, how much can the vpu amortize that fat.

Assuming LRBn (the thing that was canned) was a 600mm2 chip with 32 cores, it comes to ~18mm2 per core. The vpu and it's registers consume 1/3 of the area so you have 12 mm2 per core of x86/<your favorite crib> overhead there.

To be fair, gpu's have instruction schedulers which take up area as well.

http://forum.beyond3d.com/showpost.php?p=1342384&postcount=15

Let's make some rough calculations to see if lrb has perf/mm numbers in the same ball park as fermi, it's nearest (and rather close) competitor, programmability wise.

Fermi area= (3 B tranny)/(2.1 B tranny)* (338 mm2 for cypress) = 482mm2. (extrapolating from cypress)

At 1.7 GHz, fermi has 1.7 (clock)* 512(alu's) * 0.5 (half speed dp) * 2(dp fma)= 870 Gflop.

that is 1.8 Gflop/mm2.

At 2GHz, lrb has 2 (clock)* (32 *16 alu's) * (0.5 half rate dp) * 2 (fma) = 1024 Gflop

That is 1.7 gflop/mm2.

The numbers look awfully close, but don't forget that LRB is maxed out. LRB uses its alu's for virtually everything, gpu's have dedicated hw (albeit ff) to give them a leg up. And gpu's have tons of ff hw to shed. In the fermi die shot, it is clear that >33% of non-pad area is taken up by things that are not alu's. Some of it is L2 cache, but 768K on 40 nm is going to be tiny compared to that giant bloat in the middle.

So present day, it is likely the overhead of doing everything in sw is holding it back, as it roughly as efficient on paper. Going forward, losing the ff hw will bring out the x86 overhead much more starkly. If it can take cover in the cpu socket fast enough, then may be it won't matter.

3dilettante · Dec 10, 2009

Nick said:
Larrabee is a revolutionary new design, and the overhead of x86 is negligible compared to the advantages. Fast to market, an abundance of existing tools, workload migration, extendability, etc.

At least the fast to market part is demonstrably false for Larrabee.

The reason for Larrabee's delay is definitely not x86. On the contrary, any other ISA choice would take far longer to achieve competitive performance. Any theoretical performance advantage would be totally nullified by an initial lack of software optimization.

I forget how many compiler back ends have been optimized for a half-width P54 core with a strap-on 512-bit masked vector ISA.
The x86 stuff that would have a tool base is vastly simpler to optimize for because it has so little in peak capability. The part that has massive peak capabilities is utterly alien to x86 tools.

We're only talking about a few percent of x86 decoder overhead anyway.

No. This has been discussed already, and the penalty is much more significant, particularly if the core has no additional OoO hardware to hide the penalty.
Contemporaneous RISC cores to the original Pentium tended to have at least a 1/3 die size and power advantage for the performance offered. There is a lot of stuff associated with x86 that contributes to bloat throughout the pipeline.

The reason Larrabee is delayed is because it's still a revolutionary new approach to use a fully generic device for rasterization.

I thought your work was an example of something that already does this.

Ideally Larrabee should be programmed directly by the application developer. The potential is huge (as proven by FQuake). The problem is it will take many years to go that route.

From a commercial perspective, this is likely far from ideal and not the apparent direction the development world is taking.

We still need a lot of progress in development tools (such as explicitely concurrent programming languages - inspired by hardware description languages). Till the day this becomes as obvious and advantageous as object-oriented programming, application developers expect APIs to handle the hardest tasks.

So you're saying that until the tools exist to abstract away concurrent concepts, the devs will have to settle on using tools that abstract away concurrent concepts that we already have.

Of course GPUs are also evolving toward greater programmability, and APIs are getting thinner to allow more direct access to the hardware. But Intel is attempting to skip ahead. Even though we won't see a Larrabee GPU in 2010, x86 is enabling Intel to get to the convergence point much faster than anyone else.

As mentioned before, unless you have an interest in integrating a Larrabee core into a chip with an x86 CPU socket, the x86 is of no real import.
Any other CPU ISA could be substituted and it would change absolutely nothing, although the implementation may be ~10% smaller and possibly tens of percent cooler/faster.

Nick · Dec 10, 2009

rpg.314 said:
If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.

Why? I think you might be stuck in the idea that a discrete device has to be controlled by an API, so you can pick an ISA that suits the API(s). For Larrabee, x86 is the API, and everything else is a layer on top of that. That's possible without x86 as well, but no equally generic ISA offers any substantial benefit over it. And none of them have such massive existing software base.

Also, why can't it be both meant for a CPU socket and a PCI-c slot? It makes no sense for Intel to write different (mediocre) drivers for each generation of IPGs and HPC devices and discrete GPUs. With x86 they can focus the effort, and the application developers will follow...

rpg.314 · Dec 10, 2009

Nick said:
Why? I think you might be stuck in the idea that a discrete device has to be controlled by an API, so you can pick an ISA that suits the API(s).

The api's like ogl/dx/ocl are device agnostic.

For Larrabee, x86 is the API, and everything else is a layer on top of that. That's possible without x86 as well, but no equally generic ISA offers any substantial benefit over it. And none of them have such massive existing software base.

I wonder how much area and power an ARM cortex A8 would save over a Pentium?

There is very little usable software outside of Intel which needs x86 to run and fits discrete larrabee use cases.

Also, why can't it be both meant for a CPU socket and a PCI-c slot? It makes no sense for Intel to write different (mediocre) drivers for each generation of IPGs and HPC devices and discrete GPUs. With x86 they can focus the effort, and the application developers will follow...

Except that the focus is on the vpu and not the associated x86 crap.

Andrew Lauritzen · Dec 10, 2009

rpg.314 said:
Except that the focus is on the vpu and not the associated x86 crap.

Are you honestly arguing that it would be easy to associate the "VPU" with a non-x86 ISA on a PCI-E part and then switch to x86 with trivial software changes for a CPU-embedded part?

And furthermore, is anyone really *not* interested in the integration of throughput-computing devices/cores into CPU sockets in the long term? I can tell you with certainty that all three of the big IHVs and tons of ISVs in this space are...

Groo The Wanderer · Dec 10, 2009

rpg.314 said:
Well, the original athlon 64 was a single core, and x86 decoder used up 10% of the die. When you shrink down the ooo bloat, that overhead is going to shoot right up. Dunno, how much can the vpu amortize that fat.

Umm, wasn't the PPro the first x86 to use a decoder/internal ISA?

-Charlie

compres · Dec 10, 2009

x86 obsession is quite puzzling. Existing software? If you write ANSI C/C++ and you have a conforming compiler the code is portable to any CPU. By existing software do people mean Microsoft software or what?

ShaidarHaran · Dec 10, 2009

Optimizations are ISA-specific. I guess some people don't think about extracting the most performance from their code though.

rpg.314 · Dec 10, 2009

Groo The Wanderer said:
Umm, wasn't the PPro the first x86 to use a decoder/internal ISA?

-Charlie

I think so, but not sure. However, could you be more clear with the point you are making?

OpenGL guy · Dec 10, 2009

ShaidarHaran said:
Optimizations are ISA-specific. I guess some people don't think about extracting the most performance from their code though.

If you have millions of lines of code, do you expect people to go through and hand-tune everything? Of course not, you rely on your compiler to get most of the performance. Things that are used frequently may get hand-tuned in assembly, but it's pretty rare and a small portion of the total code.

rpg.314 · Dec 10, 2009

rpg.314 said:
If lrb i meant to go on a cpu socket, then x86 is the only ISA that makes sense.

If lrb is meant to remain hidden behind the pci-e bus, then x86 makes no sense.

This is the real tough nut to crack. :???:

I am puzzled why Intel is selling lrb as an accelerator, when it can very well go on the cpu socket. It could be a stop-gap marketing hack, but we don't know Intel's plans for now for sure.

Andrew Lauritzen said:
Are you honestly arguing that it would be easy to associate the "VPU" with a non-x86 ISA on a PCI-E part and then switch to x86 with trivial software changes for a CPU-embedded part?

To be sure, it's not trivial. But I don't see the need of x86 in the first place.

And furthermore, is anyone really *not* interested in the integration of throughput-computing devices/cores into CPU sockets in the long term? I can tell you with certainty that all three of the big IHVs and tons of ISVs in this space are...

I'd very much like integration of the two on the same die. However, it is not at all obvious that the individual core of these throughput optimized processors has to be powered by x86 ISA to get off the ground.

The only reason for x86 is binary compatibility. And as I have said before, and as you yourselves will notice if you look around, there is very little usable software outside of Intel which needs x86 to run and fits larrabee use cases.

My gpu can run word 95. OMG!!!! Such wet dreams make for nice demos, don't help in real world. Anymore than the possibility of running Windows 98 + office 97 on a mobile phone is of any utility.

FWIW, to the best of my knowledge, any code that uses MMX (or anything newer) won't run on larrabee.

x86 is needed if the only way to write code is to compile it to metal and then ship it. It was the model ~15 (20?) years ago, when we had risc vs cisc wars. The throughput software model of today is to write code in a constrained (ie ocl/dxcs) / implicitly functional (ie, glsl) language and then ship in bytecode/raw string form and jit it at runtime. Hell, even CUDA doesn't compile to a real-machine ISA by default.

Which bit of it needs x86 exactly?

The old model is dead/irrelevant today. Nobody is following that model anymore.

Consider a hypothetical dual core Bobcat core with a 7870 on die. Now put 4 of these on a 4P server for HPC space. Which bits of the latency optimized + throughput optimized nirvana we all dream of, this combination fall short of?

Consider a hypothetical quad core bulldozer core with a 7770 on die. Now put 1 of these in a the desktop. Which bits of the latency optimized + throughput optimized nirvana we all dream of, this combination fall short of?

Consider a hypothetical dual core bulldozer core with a 7750 on die. Now put 1 of these in a notebook. Which bits of the latency optimized + throughput optimized nirvana we all dream of, this combination fall short of?

Binary compatibility to run sw outside of the niche it was written for, is over-rated. Get over it.

ShaidarHaran · Dec 10, 2009

OpenGL guy said:
If you have millions of lines of code, do you expect people to go through and hand-tune everything? Of course not, you rely on your compiler to get most of the performance. Things that are used frequently may get hand-tuned in assembly, but it's pretty rare and a small portion of the total code.

Well of course you don't need to tune every line of code, you profile to see what's eating CPU cycles and then go from there.

3dilettante · Dec 10, 2009

ShaidarHaran said:
Optimizations are ISA-specific. I guess some people don't think about extracting the most performance from their code though.

A large body of optimizations are implementation-specific, and Larrabee is more anemic on the x86 side than even the originating P54 core, enough so that potentially any Pentium-type optimizations may be irrelevant to it.

Which ISA-specific optimizations are particularly relevant to a single-issue x86 that runs a workload that should be dominated by the vector throughput? We wouldn't have any classic compilers that have any reference to the vector component needed for dual-issue.
Just how bad a compiler does one need to make getting decent utilization out of a single-issue chip in a very short time period, and why is it that no other ISA has a couple dozen compiler back-ends that targeted single-issue variants in the decades they've been around?

ShaidarHaran · Dec 10, 2009

3dilettante said:
A large body of optimizations are implementation-specific, and Larrabee is much more anemic on the x86 side than even the originating P54 core.

Which ISA-specific optimizations are particularly relevant to a single-issue x86 that runs a workload that should be dominated by the vector throughput? We wouldn't have any classic compilers that have any reference to the vector component needed for dual-issue?
Just how bad a compiler does one need to make getting decent utilization out of a single-issue chip in a very short time period, and why is it that no other ISA doesn't have a couple dozen compiler back-ends that targeted single-issue variants?

Depends on what the software dev wants to do with their app. Isn't that the whole point of Larrabee? Programmability above all else (with decent performance). I agree that the easiest way to get any sort of usefulness out of Larrabee would be to target its vector extensions rather than to attempt to write a 3d engine in x86 from the ground-up, but the option is there.

pcchen · Dec 10, 2009

Groo The Wanderer said:
Umm, wasn't the PPro the first x86 to use a decoder/internal ISA?

NexGen's Nx586 is probably the first commercial x86 CPU to do so (it's released about a year earlier than Pentium Pro).

Andrew Lauritzen · Dec 10, 2009

rpg.314 said:
Binary compatibility to run sw outside of the niche it was written for, is over-rated.

Compatibility of legacy software aside, there's no need for heterogeneous processors on the same die/socket to use a different ISA. Most of the people I've spoken to consider a more unified ISA to be the end goal here, whether it be x86 or something else entirely, and consider the current state of having to target a pile of different ISAs far from ideal. Even with JIT (which is great of course) it's still a problem and definitely non-ideal. Sure you can make do on Cell-like models and such, but there's no question that it's harder and less flexible.

Having a standardized ISA obviously also doesn't preclude the existence of virtual ISAs, and indeed they still provide utility. That said, the existence of these virtual ISAs does not invalidate all of the benefits of having a standard hardware ISA.

I think your hostility towards this concept is a little unwarranted. While there obviously is a continuum of cost vs. benefit, you're dismissing all benefits out of hand and assuming a huge cost. Without knowing the solid numbers here (or maybe you do have inside info?) I think it's a bit naive for you to assume that the people involved haven't run the real numbers themselves. Of course you're welcome to your own opinions, but I'd ask for some moderation unless you have real numbers to back up your claims.

For my part as a software developer, it would be great if I had some throughput cores that were directly accessible the same as any other core in the OS. It would be awesome if they shared a memory space and consumed the same binaries as the bigger cores since that would allow my schedulers to do on-the-fly load balancing and keep the system fed all the time. Obviously there's a hardware cost to this and the cost vs. benefit is the ultimate goal, but I really don't think you can make the argument that there's no benefit when there clearly is.

Lightman · Dec 10, 2009

pcchen said:
NexGen's Nx586 is probably the first commercial x86 CPU to do so (it's released about a year earlier than Pentium Pro).

Yeah, a RISC pretending to be a CISC!

I even had an old pic of NexT PC with that CPU in a magazine from 90's :smile:

3dilettante · Dec 10, 2009

ShaidarHaran said:
Depends on what the software dev wants to do with their app. Isn't that the whole point of Larrabee? Programmability above all else (with decent performance). I agree that the easiest way to get any sort of usefulness out of Larrabee would be to target its vector extensions rather than to attempt to write a 3d engine in x86 from the ground-up, but the option is there.

I'm questioning the idea that the x86 side of Larrabee puts it at any advantage with regards to compilers and optimizations than any other CPU ISA when the target is a single-issue in-order processor.

That part of the chip is what would benefit from the long line of x86 compilers and the years of research.
That part is so limited in Larrabee that any other chip on any other ISA using a similar arrangement would not need significant effort to get equivalent results in a very short time frame. There's just not enough potential to waste, and the parts that have potential are too new to benefit.

Humus · Dec 11, 2009

ShaidarHaran said:
Optimizations are ISA-specific.

Most optimizations are algorithmic and completely ISA agnostic. Algoritmic improvements almost always boost performance way more than ISA specific optimizations do. Sometimes ISA specific is the most reasonable way forward, for instance if that matrix multiply that's used everywhere can be SSE optimized. But when you're profiling and finding that some for-loop somewhere is consuming a surprisingly large percent of the CPU time because it's looping over thousands of objects every frame you don't begin by adding SSE code, inserting prefetch calls etc. First step is perhaps to separate active and inactive objects into separate lists so you only have to loop over the handful active objects instead of checking the enable flag of them all. ISA specific optimizations might have halved the CPU time, but the algorithmic optimization could perhaps shave off 90% of the CPU time.

Larrabee delayed to 2011 ?

Ailuros

Epsilon plus three

rpg.314

3dilettante

Nick

rpg.314

Andrew Lauritzen

Moderator

Groo The Wanderer

compres

ShaidarHaran

hardware monkey

rpg.314

OpenGL guy

rpg.314

ShaidarHaran

hardware monkey

3dilettante

ShaidarHaran

hardware monkey

pcchen

Moderator

Andrew Lauritzen

Moderator

Lightman

3dilettante

Humus

Crazy coder