Larrabee delayed to 2011 ?

Something like x264 is very coarsely multithreaded though (also purely task based, so not a shining example of the virtues of coherency). Fiberizing it would require huge structural changes.
 
And before you ask what could you do with 3% more transistors, ask yourselves why LRB's bogo-flops/mm and bogo-flops/W are not very impressive wrt it's competitors of today.
NVIDIA's flops/W and flops/mm2 aren't impressive compared to ATI either, but obviously that doesn't tell the whole story. When discussing things that are fundamentally based in *efficiency* of code running on a specific architecture, you need to look at comparing the ideal algorithms to solve a *problem* written on a given architecture, not the same code. Already you have to write different code for ATI and NVIDIA DX11 parts if you want to get near the best performance on either.

in 2000, intrinsics made sense.

In 2010, OCL/DXCS/CUDA/make sense.
I don't think anyone is disagreeing with you on this. Doesn't mean the stupid stuff in OCL/DX/CUDA has to stay that way though, and I think it's fair to say that a lot of the memory and execution model of these languages is borderline broken.
 
I am not hearing numbers.

Unless you work in the field, you likely won't hear numbers. You can easily calculate the overheads for the caches themselves (which are QUITE small). For the communication overheads, it largely depends on what coherence protocol you are using and by and large, the good ones aren't public.


Why is software encode better than hardware encode? because fundamentally, good encode is a decision making process. There are some basic set of operations that can be accelerated, but they just primarily generate data that needs to be weighed and offset.
 
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.
 
Most estimation points towards 35- 40M transistors for x86 decode.

Err, I have trouble believing any estimation made by someone who has even a moderate idea about the topic would point towards those figures. They're nonsense, sorry. Think about it historically. Hell, think about how many transistors are in a Nehalem core(excluding caches). Hating on x86 is cool and trendy and all, but this is a bit too much.
 
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.

I think that is more transistors than shipped in any x86 core EVER. In fact, I'm pretty sure that excluding L1 caches, no one has eclipsed the 10 mil mark yet for a core.
 
I think that is more transistors than shipped in any x86 core EVER. In fact, I'm pretty sure that excluding L1 caches, no one has eclipsed the 10 mil mark yet for a core.

What?! Haha wow. I didn't know that. So what exactly is filling up the ~500M transistors difference in a modern Quad Core X86? Is that all just cache and other such supporting circuitry?
 
What?! Haha wow. I didn't know that. So what exactly is filling up the ~500M transistors difference in a modern Quad Core X86? Is that all just cache and other such supporting circuitry?

Take an Intel i7 die. L3 cache is around 1/3 of the area. Another 10-20% for uncore and you have less than 15% for each core. And that includes L1 and L2 cache, the latter taking a significant portion.
 
Caches uses a lot of transistors but requires a small area, it's about 54 transistors per byte just for the cells, a few more for control, tag and others, the 12MB in Core 2 Quad already takes most of it's transtors count, taking out cachess A Phenom II core have less than 40 mil transistors, a classic Athlon less than 15 mil and the original 8086 about 20k.

Also, not all decoders are made equal, on general porpouse CPUs (like Cores and Phenoms) the decoders are optimized to reduce latency, I mean, to decode instructions as fast as possible after fetched, in CPUs like Larrabee they could be optimized for space, take longer to decode instructions (more pipeline stages/lower clock) but taking much less space.
 
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.

:LOL:

It might even be 35-40 million transistors for the entire Larrabee die.

In Atom, which Larrabee is closest to, the core without the L2 caches and the I/O has less than 14 million transistors. Without the L1 cache we are talking about 10-11 million.

Do you realize how RIDICULOUS your comment sounds?
 
What do you think about this new Charlie's piece?
A clean-up of X86 ISA? Something new? He is wrong altogether?
 
Last edited by a moderator:
The article didn't have much meat to it, given the amount of words expended.
It's rather vague on the things I would have liked to know, such as what issues there were and what is to be changed, or what stepping Intel used in its live demo (given the numbers, it seems it was about half of what Intel hoped).

The "converged pipeline" scheme seems nebulous to me. Is this converged in the sense of a convergence with mainline x86 that makes it incompatible with Larrabee I, or a converged pipeline in the sense that there is no scalar and vector demarcation, and that this would be incompatible with both Larrabee I and mainline x86.

The latter interpretation would make the "it benefits from being x86" even less relevant than it was for the P54C chimera that was Larrabee I. The "Intel controls the compilers so breaking the ISA is cool" rationalization is just another point of how irrelevant the x86 ISA is for anyone but Intel.

If it turns a Larrabee core into basically a core that runs an AVX/LRBni hybrid, its execution units could someday be transplanted into a vector block, something akin to what AMD might be trying with its shared and separately scheduled FPU blocks in Bulldozer.

A core or computing cluster whose base granularity is that of a vector unit or units would be oddly familiar for the GPU folks.
 
The article didn't have much meat to it, given the amount of words expended.
It's rather vague on the things I would have liked to know, such as what issues there were and what is to be changed, or what stepping Intel used in its live demo (given the numbers, it seems it was about half of what Intel hoped).

The "converged pipeline" scheme seems nebulous to me. Is this converged in the sense of a convergence with mainline x86 that makes it incompatible with Larrabee I, or a converged pipeline in the sense that there is no scalar and vector demarcation, and that this would be incompatible with both Larrabee I and mainline x86.

The latter interpretation would make the "it benefits from being x86" even less relevant than it was for the P54C chimera that was Larrabee I. The "Intel controls the compilers so breaking the ISA is cool" rationalization is just another point of how irrelevant the x86 ISA is for anyone but Intel.

If it turns a Larrabee core into basically a core that runs an AVX/LRBni hybrid, its execution units could someday be transplanted into a vector block, something akin to what AMD might be trying with its shared and separately scheduled FPU blocks in Bulldozer.

A core or computing cluster whose base granularity is that of a vector unit or units would be oddly familiar for the GPU folks.

It would make it very SPU like too... well the kind of SPU some people here want (multi-threaded, cache based, etc...)... *cough* nAo *cough*...
 
Back
Top