Larrabee delayed to 2011 ?

MfA · May 12, 2010

Something like x264 is very coarsely multithreaded though (also purely task based, so not a shining example of the virtues of coherency). Fiberizing it would require huge structural changes.

Andrew Lauritzen · May 12, 2010

rpg.314 said:
And before you ask what could you do with 3% more transistors, ask yourselves why LRB's bogo-flops/mm and bogo-flops/W are not very impressive wrt it's competitors of today.

NVIDIA's flops/W and flops/mm2 aren't impressive compared to ATI either, but obviously that doesn't tell the whole story. When discussing things that are fundamentally based in *efficiency* of code running on a specific architecture, you need to look at comparing the ideal algorithms to solve a *problem* written on a given architecture, not the same code. Already you have to write different code for ATI and NVIDIA DX11 parts if you want to get near the best performance on either.

rpg.314 said:
in 2000, intrinsics made sense.

In 2010, OCL/DXCS/CUDA/make sense.

I don't think anyone is disagreeing with you on this. Doesn't mean the stupid stuff in OCL/DX/CUDA has to stay that way though, and I think it's fair to say that a lot of the memory and execution model of these languages is borderline broken.

aaronspink · May 12, 2010

rpg.314 said:
I am not hearing numbers.

Unless you work in the field, you likely won't hear numbers. You can easily calculate the overheads for the caches themselves (which are QUITE small). For the communication overheads, it largely depends on what coherence protocol you are using and by and large, the good ones aren't public.

WHY?

Why is software encode better than hardware encode? because fundamentally, good encode is a decision making process. There are some basic set of operations that can be accelerated, but they just primarily generate data that needs to be weighed and offset.

larrabee · May 12, 2010

this sounds familiar.
http://www.semiaccurate.com/2009/07/21/larrabee-has-two-hd-decoders/
larrabee has 2 ff HD decoders?

iwod · May 16, 2010

Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.

AlexV · May 16, 2010

iwod said:
Most estimation points towards 35- 40M transistors for x86 decode.

Err, I have trouble believing any estimation made by someone who has even a moderate idea about the topic would point towards those figures. They're nonsense, sorry. Think about it historically. Hell, think about how many transistors are in a Nehalem core(excluding caches). Hating on x86 is cool and trendy and all, but this is a bit too much.

aaronspink · May 16, 2010

iwod said:
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.

I think that is more transistors than shipped in any x86 core EVER. In fact, I'm pretty sure that excluding L1 caches, no one has eclipsed the 10 mil mark yet for a core.

Squilliam · May 16, 2010

aaronspink said:
I think that is more transistors than shipped in any x86 core EVER. In fact, I'm pretty sure that excluding L1 caches, no one has eclipsed the 10 mil mark yet for a core.

What?! Haha wow. I didn't know that. So what exactly is filling up the ~500M transistors difference in a modern Quad Core X86? Is that all just cache and other such supporting circuitry?

Blazkowicz · May 16, 2010

If I had to guess I'd say a x86 core is about 35-40m transistors

CRoland · May 16, 2010

Squilliam said:
What?! Haha wow. I didn't know that. So what exactly is filling up the ~500M transistors difference in a modern Quad Core X86? Is that all just cache and other such supporting circuitry?

Take an Intel i7 die. L3 cache is around 1/3 of the area. Another 10-20% for uncore and you have less than 15% for each core. And that includes L1 and L2 cache, the latter taking a significant portion.

EduardoS · May 16, 2010

Caches uses a lot of transistors but requires a small area, it's about 54 transistors per byte just for the cells, a few more for control, tag and others, the 12MB in Core 2 Quad already takes most of it's transtors count, taking out cachess A Phenom II core have less than 40 mil transistors, a classic Athlon less than 15 mil and the original 8086 about 20k.

Also, not all decoders are made equal, on general porpouse CPUs (like Cores and Phenoms) the decoders are optimized to reduce latency, I mean, to decode instructions as fast as possible after fetched, in CPUs like Larrabee they could be optimized for space, take longer to decode instructions (more pipeline stages/lower clock) but taking much less space.

DavidC · May 16, 2010

iwod said:
Most estimation points towards 35- 40M transistors for x86 decode. While that may not sounds a lot in today billion transistor era, when you are talking about Many Core ( 16+ ), you will need quite a lot of x86 decoder to serve it. So that is a few hundred million transistor wasted on a Larrabee.

It might even be 35-40 million transistors for the entire Larrabee die.

In Atom, which Larrabee is closest to, the core without the L2 caches and the I/O has less than 14 million transistors. Without the L1 cache we are talking about 10-11 million.

Do you realize how RIDICULOUS your comment sounds?

liolio · May 18, 2010

What do you think about this new Charlie's piece?
A clean-up of X86 ISA? Something new? He is wrong altogether?

Panajev2001a · May 18, 2010

liolio said:
What do you think about this new Charlie's piece?
A clean-up of X86 ISA? Something new? He is wrong altogether?

I say... welcome PS4's GPU

.

Ailuros · May 18, 2010

Panajev2001a said:
I say... welcome PS4's GPU .

I'd love to know what Intel would showcase if they'd negotiate for graphics console IP today.

3dilettante · May 18, 2010

The article didn't have much meat to it, given the amount of words expended.
It's rather vague on the things I would have liked to know, such as what issues there were and what is to be changed, or what stepping Intel used in its live demo (given the numbers, it seems it was about half of what Intel hoped).

The "converged pipeline" scheme seems nebulous to me. Is this converged in the sense of a convergence with mainline x86 that makes it incompatible with Larrabee I, or a converged pipeline in the sense that there is no scalar and vector demarcation, and that this would be incompatible with both Larrabee I and mainline x86.

The latter interpretation would make the "it benefits from being x86" even less relevant than it was for the P54C chimera that was Larrabee I. The "Intel controls the compilers so breaking the ISA is cool" rationalization is just another point of how irrelevant the x86 ISA is for anyone but Intel.

If it turns a Larrabee core into basically a core that runs an AVX/LRBni hybrid, its execution units could someday be transplanted into a vector block, something akin to what AMD might be trying with its shared and separately scheduled FPU blocks in Bulldozer.

A core or computing cluster whose base granularity is that of a vector unit or units would be oddly familiar for the GPU folks.

Panajev2001a · May 18, 2010

3dilettante said:
The article didn't have much meat to it, given the amount of words expended.
It's rather vague on the things I would have liked to know, such as what issues there were and what is to be changed, or what stepping Intel used in its live demo (given the numbers, it seems it was about half of what Intel hoped).

The "converged pipeline" scheme seems nebulous to me. Is this converged in the sense of a convergence with mainline x86 that makes it incompatible with Larrabee I, or a converged pipeline in the sense that there is no scalar and vector demarcation, and that this would be incompatible with both Larrabee I and mainline x86.

The latter interpretation would make the "it benefits from being x86" even less relevant than it was for the P54C chimera that was Larrabee I. The "Intel controls the compilers so breaking the ISA is cool" rationalization is just another point of how irrelevant the x86 ISA is for anyone but Intel.

If it turns a Larrabee core into basically a core that runs an AVX/LRBni hybrid, its execution units could someday be transplanted into a vector block, something akin to what AMD might be trying with its shared and separately scheduled FPU blocks in Bulldozer.

A core or computing cluster whose base granularity is that of a vector unit or units would be oddly familiar for the GPU folks.

It would make it very SPU like too... well the kind of SPU some people here want (multi-threaded, cache based, etc...)... *cough* nAo *cough*...

MfA · May 18, 2010

If SPUs were 16 wide.

Arwin · May 18, 2010

MfA said:
If SPUs were 16 wide.

Forgive me the question, but 16 what wide?

3dilettante · May 18, 2010

The number of vector lanes. For single-precision FP, an SPE is only 4-wide, while Larrabee I was 16-wide.

Larrabee delayed to 2011 ?

MfA

Andrew Lauritzen

Moderator

aaronspink

larrabee

iwod

AlexV

Heteroscedasticitate

aaronspink

Squilliam

Beyond3d isn't defined yet

Blazkowicz

CRoland

EduardoS

DavidC

liolio

Aquoiboniste

Panajev2001a

Ailuros

Epsilon plus three

3dilettante

Panajev2001a

MfA

Arwin

Now Officially a Top 10 Poster

3dilettante