Intel working on Larrabee 2.

If you want to discuss it please open a new thread, in this context is very clear what Intel meant for emulation, and that settles it.
 
So if we already expect a 64 core 2GHz Larrabee I, anyone care to speculate what they have in mind for Larrabee II?
Hard to tell when we don't even know how many TMUs LRB has. Though I guess we can expect LRB2 to be mass produced on a 32nm process..(If, as it seems, LRB1 is on a 45nm process)
 
Hard to tell when we don't even know how many TMUs LRB has. Though I guess we can expect LRB2 to be mass produced on a 32nm process..(If, as it seems, LRB1 is on a 45nm process)

We might have to differentiate between the 32-core Larrabee and 64-core variant.
At 45nm, Atom is weighing in at 47 million transistors and 25 mm2.
Larrabee was speculated to be around 30-33 million.
A Larrabee core, going naively by transistor count, should be north of 16mm2.

64x16 is 1024, and no reticle goes that high.

32 Larrabee cores sounds doable at 45nm, while 64 looks like the middle product between the first Larrabee chips and the second generation.
 
None of the board schematics showed two chips, though official info is understandably scarce.

Not much has been said about Larrabee's multisocket interconnect links, and given the probable TDP of a single Larrabee, they'd might have to wait until 32nm to fit two of them on a board.

The other question is whether it would be cache-coherent between the chips. Many-core coherent designs like Niagara handle coherence well on-die, but they are extremely limited when it comes to socket scaling.
 
We might have to differentiate between the 32-core Larrabee and 64-core variant.
At 45nm, Atom is weighing in at 47 million transistors and 25 mm2.
Larrabee was speculated to be around 30-33 million.
A Larrabee core, going naively by transistor count, should be north of 16mm2.

64x16 is 1024, and no reticle goes that high.

32 Larrabee cores sounds doable at 45nm, while 64 looks like the middle product between the first Larrabee chips and the second generation.

Hang on, Its said that Larrabee (and dunington) are 1.9 billion trans @ 45nm.... so.

64 * 30 milllion = 1.92 billion trannies.

and 1.92 billion trannies at 45nm = 503mm^2 (or something like that.)

So something isnt adding up? Is that cache trans are cheaper / smaller than logic?
 
Those numbers sound bogus, and we haven't still factored in TMUs, ring bus, memory controller, etc..
 
Sorry if this is off topic, but a few people seem confused. Feel free to ignore or have a mod move it.

For a hardware designer simulation means running a standard software simulator on a normal CPU, using something like Synopsys VCS. The major advantage of simulations is that you can do anything you want: get a waveform of every single net in the design, arbitrarily change values wherever you want, do all sorts of testbench hacks that don't translate into anything you can actually do in hardware. The problem is that simulations run at like 1/1,000,000 times real speed, so you can only do very basic functional testing at this level.

Emulation is very different, it's actually synthesizing the design to run on generic hardware. The simplest way to do this is to compile the design onto an FPGA. Problem is FPGAs are orders of magnitude too small for something like LRB. The solution is basically a gigantic box full of things resembling FPGAs, like Mentor's Veloce. The draw of emulation is that you can get a system running at maybe 1/1,000 or even 1/100 times real speed (depending on size, complexity, etc). It's painfully slow, but fast enough to run real programs and get real results on. The problem is that you're basically running in a huge black box, so when something goes wrong it's non-obvious how to debug the system. Depending on the emulation box there are also all sorts of strange restrictions and quirks (nothing ever Just Works).

Intel's looking for someone to help get LRB2 up and running on whatever emulation platform they use. It's a pretty common job, just strange that they would actually mention LRB2 in the posting.
 
Last edited by a moderator:
Hang on, Its said that Larrabee (and dunington) are 1.9 billion trans @ 45nm.... so.

64 * 30 milllion = 1.92 billion trannies.

and 1.92 billion trannies at 45nm = 503mm^2 (or something like that.)

So something isnt adding up? Is that cache trans are cheaper / smaller than logic?

Dunnington's transistor count is nearly two thirds cache, and the cache could fill less than half the die.

Dunnington's die size also includes a very bulky system interface that ties all the cores together. As has been mentioned, the non-core elements of Larrabee have not been factored into the area and transistor count estimations, while the hard numbers for Dunnington have figured them in by default.

Larrabee's cache transistor count per core is maybe half, and the cache might take up a quarter of a core block's area.
The question is how densely the logic can be packed, especially with the big stonkin' vector unit.
The ratio of highly compressed cache versus more expanded logic is lower for Larrabee.

Atom is the closest contemporaneous x86 to Larrabee. There are marked differences, so the comparison must be very, very loose, but it is much closer philosophically to Larrabee than, say Core2.
There a bunch of factors, such as the FSB interface, less dense L1, modular layout, and different circuit design targets that can make Atom less dense than it could be, but the idea that it could be up more than 3 times too large compared to Larrabee on the exact same process node just doesn't sit well with me.

Maybe Intel has done a fantastic job of compressing the core, otherwise the physical constraints make such a chip too big.
At 32nm, 64 cores sounds far more doable.
 
Dunnington's transistor count is nearly two thirds cache, and the cache could fill less than half the die.

Dunnington's die size also includes a very bulky system interface that ties all the cores together. As has been mentioned, the non-core elements of Larrabee have not been factored into the area and transistor count estimations, while the hard numbers for Dunnington have figured them in by default.

Larrabee's cache transistor count per core is maybe half, and the cache might take up a quarter of a core block's area.
The question is how densely the logic can be packed, especially with the big stonkin' vector unit.
The ratio of highly compressed cache versus more expanded logic is lower for Larrabee.

Atom is the closest contemporaneous x86 to Larrabee. There are marked differences, so the comparison must be very, very loose, but it is much closer philosophically to Larrabee than, say Core2.
There a bunch of factors, such as the FSB interface, less dense L1, modular layout, and different circuit design targets that can make Atom less dense than it could be, but the idea that it could be up more than 3 times too large compared to Larrabee on the exact same process node just doesn't sit well with me.

Maybe Intel has done a fantastic job of compressing the core, otherwise the physical constraints make such a chip too big.
At 32nm, 64 cores sounds far more doable.

So Nvidias GT200 wouldnt be a good philosophical comparison then?

1.4 billion - 65nm - 570mm^2

Assuming linear scaling we see that...

1.9 billion - 45nm - 380mm^2

From this perspective a 64 core 45nm part could be reached with 500mm^2. Granted it would be INSANE! Im sorry I am being naive, I realise there are complex variables I dont understand that effect all this.
 
So Nvidias GT200 wouldnt be a good philosophical comparison then?
I don't think it's close enough, the designs and processes are very different and difficult to compare. One chip already exists, and the other at least publically does not.
Atom as a multithreaded in-order x86, with the exception of a few threads and the vector units, has more shared heritage.

1.4 billion - 65nm - 570mm^2
Plus whatever transistors and area NVIO takes up, unless Larrabee has a separate chip for that too.

A huge chunk of that die area is also taken up by things like texturing units, the rasterizer, memory controllers, and ROPs.
Larrabee may dispense with a lot of that, but we know it has texture units and memory controllers.

You might find the possible area of the cores themselves, to the exclusion of everything else the chip needs.

Assuming linear scaling we see that...
That assumption is one that hasn't held true for some time, particularly for logic.

1.9 billion - 45nm - 380mm^2

From this perspective a 64 core 45nm part could be reached with 500mm^2.
Your basis with the GPU uses area figures for a chip with a lot of area not devoted to computation cores.
At best, we get a possibly optimistic estimate of a fraction of Larrabee's die size, though given it is mostly x86 it would be a majority of the die.
 
So Nvidias GT200 wouldnt be a good philosophical comparison then?

1.4 billion - 65nm - 570mm^2

Assuming linear scaling we see that...

1.9 billion - 45nm - 380mm^2

From this perspective a 64 core 45nm part could be reached with 500mm^2. Granted it would be INSANE! Im sorry I am being naive, I realise there are complex variables I dont understand that effect all this.

Also, NVIDIA's GT200 has a very very low density (think non-fill-cell-area/total-area), so NV can do much better in terms of perf/mm^2, etc. I would think that something with a very optimal, regular layout can achieve much much more (like RV770, sorry for the low blow NV :p )
 
Also, NVIDIA's GT200 has a very very low density (think non-fill-cell-area/total-area), so NV can do much better in terms of perf/mm^2, etc. I would think that something with a very optimal, regular layout can achieve much much more (like RV770, sorry for the low blow NV :p )

Well, considering that Larrabee is not a 'thoroughbred' GPU, but relies on software threads to implement certain functionality, I don't think the perf/mm^2 will be anywhere near optimal for Larrabee.
 
Well in 'computer architecture' or at least how I have always understood it an emulator is a program that reproduces the functionality of some hardware stuff, let say a functional emulator (have you ever played with M.A.M.E. or any other of the old arcade or console emulators?) and a simulator is a program that simulates the 'timing' and limitations of the hardware being reproduced. A simulator is used for performance evalutation. Usually a simulator includes or works along a functional emulator. The emulator would be used by the software teams to start the early testing of your 'software' (or drivers). The simulator is used by the architects to prove that the P4 at 5 GHz will fly ... and then comes all those damned RTL and silicons guys and screw everything and you get a small section of nuclear reactor core at your motherboard :LOL:

I have had already a few discussions with electrical engineers telling me that is completely the other way around or that emulator means hardware and simulator software (that's the FPGA vs RTL tools like VCS).

For that job offer I'm not really sure what it means it's either FPGA or just functional emulation for SDK or early software development. But mixing so many concepts is confusing and could also mean that who is hiring doesn't actually knows where exactly the hired one will finally land ;)
 
Also, NVIDIA's GT200 has a very very low density (think non-fill-cell-area/total-area), so NV can do much better in terms of perf/mm^2, etc. I would think that something with a very optimal, regular layout can achieve much much more (like RV770, sorry for the low blow NV :p )
Hey, only the truth hurts! :) (and this case, that means it probably hurts a lot...) - I think it's pretty mind blowing that they decided to go down that road to maximize yields, while simultaneously having *no* coarse-grained redundancy for the GTX 280 and presumably little fine-grained redundancy. Doesn't make any sense to me unless for a bizarre reasons parametric yield problems increase super-exponentially as you increase density. Of course, maybe their flow encouraged non-regular but I suspect that's probably not the main issue.

In my opinion, the main difficulty for next-generation architectures will be to find the right mix of fine-grained redundancy, coarse-grained redundancy, and asynchronicity. By the latter, I mean different clock speeds for otherwise identical parts of the chip in order to fight variability. At the same time, triangle setup rates should be expanded significantly above 1T/c, which imposes complex restrictions of its own throughout the architecture.

Each potentially asynchronous block would need to be big enough to minimize overhead and allow for sufficient fine-grained redundancy (so as to make 1 in X defects due to high density irrelevant), while at the same time being small enough to make coarse-grained redundancy viable and allow for good performance scalability. Fun stuff! (or hey we could just do it pseudo-randomly, wouldn't be the first time :D)
 
Well, considering that Larrabee is not a 'thoroughbred' GPU, but relies on software threads to implement certain functionality, I don't think the perf/mm^2 will be anywhere near optimal for Larrabee.

That's a very good point. Architecturally, Larrabee isn't very efficient due to the fact that it's very general purpose. However, I bet that Intel can meet or surpass ATI's RV770 transistor density easily.
 
Hey, only the truth hurts! :) (and this case, that means it probably hurts a lot...) - I think it's pretty mind blowing that they decided to go down that road to maximize yields, while simultaneously having *no* coarse-grained redundancy for the GTX 280 and presumably little fine-grained redundancy. Doesn't make any sense to me unless for a bizarre reasons parametric yield problems increase super-exponentially as you increase density. Of course, maybe their flow encouraged non-regular but I suspect that's probably not the main issue.


I think that they were just lazy because of no competition :p
 
That's a very good point. Architecturally, Larrabee isn't very efficient due to the fact that it's very general purpose. However, I bet that Intel can meet or surpass ATI's RV770 transistor density easily.
Nearly by definition, higher flexibility implies a larger ratio of control logic. While this should not be exaggerated, I think it would be hard to argue that helps to improve density... (control logic is neither naturally regular nor easy to tweak manually at a fine level). I don't disagree with you at all that Intel has a natural advantage here, but they don't have everything going for them either.

I think that they were just lazy because of no competition :p
Well, if you really wanted to be a nice guy you could also argue they over-prioritized DX11 because of the Larrabee threat as well as 40/45nm because of the short G80->65->45nm time gap. However I'm not convinced they did that or got lazy. I prefer the simpler explanation that they just screwed up. At the same time AMD came out with a wonderful incremental improvement.
 
Back
Top