Intel working on Larrabee 2.

Discussion in 'Architecture and Products' started by nAo, Oct 4, 2008.

  1. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    If you want to discuss it please open a new thread, in this context is very clear what Intel meant for emulation, and that settles it.
     
  2. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    So if we already expect a 64 core 2GHz Larrabee I, anyone care to speculate what they have in mind for Larrabee II?
     
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Hard to tell when we don't even know how many TMUs LRB has. Though I guess we can expect LRB2 to be mass produced on a 32nm process..(If, as it seems, LRB1 is on a 45nm process)
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,193
    Likes Received:
    3,133
    Location:
    Well within 3d
    We might have to differentiate between the 32-core Larrabee and 64-core variant.
    At 45nm, Atom is weighing in at 47 million transistors and 25 mm2.
    Larrabee was speculated to be around 30-33 million.
    A Larrabee core, going naively by transistor count, should be north of 16mm2.

    64x16 is 1024, and no reticle goes that high.

    32 Larrabee cores sounds doable at 45nm, while 64 looks like the middle product between the first Larrabee chips and the second generation.
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Two chips on the same board configuration? :)
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,193
    Likes Received:
    3,133
    Location:
    Well within 3d
    None of the board schematics showed two chips, though official info is understandably scarce.

    Not much has been said about Larrabee's multisocket interconnect links, and given the probable TDP of a single Larrabee, they'd might have to wait until 32nm to fit two of them on a board.

    The other question is whether it would be cache-coherent between the chips. Many-core coherent designs like Niagara handle coherence well on-die, but they are extremely limited when it comes to socket scaling.
     
  7. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    Hang on, Its said that Larrabee (and dunington) are 1.9 billion trans @ 45nm.... so.

    64 * 30 milllion = 1.92 billion trannies.

    and 1.92 billion trannies at 45nm = 503mm^2 (or something like that.)

    So something isnt adding up? Is that cache trans are cheaper / smaller than logic?
     
  8. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Yes.
     
  9. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Those numbers sound bogus, and we haven't still factored in TMUs, ring bus, memory controller, etc..
     
  10. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    60
    Sorry if this is off topic, but a few people seem confused. Feel free to ignore or have a mod move it.

    For a hardware designer simulation means running a standard software simulator on a normal CPU, using something like Synopsys VCS. The major advantage of simulations is that you can do anything you want: get a waveform of every single net in the design, arbitrarily change values wherever you want, do all sorts of testbench hacks that don't translate into anything you can actually do in hardware. The problem is that simulations run at like 1/1,000,000 times real speed, so you can only do very basic functional testing at this level.

    Emulation is very different, it's actually synthesizing the design to run on generic hardware. The simplest way to do this is to compile the design onto an FPGA. Problem is FPGAs are orders of magnitude too small for something like LRB. The solution is basically a gigantic box full of things resembling FPGAs, like Mentor's Veloce. The draw of emulation is that you can get a system running at maybe 1/1,000 or even 1/100 times real speed (depending on size, complexity, etc). It's painfully slow, but fast enough to run real programs and get real results on. The problem is that you're basically running in a huge black box, so when something goes wrong it's non-obvious how to debug the system. Depending on the emulation box there are also all sorts of strange restrictions and quirks (nothing ever Just Works).

    Intel's looking for someone to help get LRB2 up and running on whatever emulation platform they use. It's a pretty common job, just strange that they would actually mention LRB2 in the posting.
     
    #30 Rufus, Oct 7, 2008
    Last edited by a moderator: Oct 7, 2008
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,193
    Likes Received:
    3,133
    Location:
    Well within 3d
    Dunnington's transistor count is nearly two thirds cache, and the cache could fill less than half the die.

    Dunnington's die size also includes a very bulky system interface that ties all the cores together. As has been mentioned, the non-core elements of Larrabee have not been factored into the area and transistor count estimations, while the hard numbers for Dunnington have figured them in by default.

    Larrabee's cache transistor count per core is maybe half, and the cache might take up a quarter of a core block's area.
    The question is how densely the logic can be packed, especially with the big stonkin' vector unit.
    The ratio of highly compressed cache versus more expanded logic is lower for Larrabee.

    Atom is the closest contemporaneous x86 to Larrabee. There are marked differences, so the comparison must be very, very loose, but it is much closer philosophically to Larrabee than, say Core2.
    There a bunch of factors, such as the FSB interface, less dense L1, modular layout, and different circuit design targets that can make Atom less dense than it could be, but the idea that it could be up more than 3 times too large compared to Larrabee on the exact same process node just doesn't sit well with me.

    Maybe Intel has done a fantastic job of compressing the core, otherwise the physical constraints make such a chip too big.
    At 32nm, 64 cores sounds far more doable.
     
  12. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    So Nvidias GT200 wouldnt be a good philosophical comparison then?

    1.4 billion - 65nm - 570mm^2

    Assuming linear scaling we see that...

    1.9 billion - 45nm - 380mm^2

    From this perspective a 64 core 45nm part could be reached with 500mm^2. Granted it would be INSANE! Im sorry I am being naive, I realise there are complex variables I dont understand that effect all this.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,193
    Likes Received:
    3,133
    Location:
    Well within 3d
    I don't think it's close enough, the designs and processes are very different and difficult to compare. One chip already exists, and the other at least publically does not.
    Atom as a multithreaded in-order x86, with the exception of a few threads and the vector units, has more shared heritage.

    Plus whatever transistors and area NVIO takes up, unless Larrabee has a separate chip for that too.

    A huge chunk of that die area is also taken up by things like texturing units, the rasterizer, memory controllers, and ROPs.
    Larrabee may dispense with a lot of that, but we know it has texture units and memory controllers.

    You might find the possible area of the cores themselves, to the exclusion of everything else the chip needs.

    That assumption is one that hasn't held true for some time, particularly for logic.

    Your basis with the GPU uses area figures for a chip with a lot of area not devoted to computation cores.
    At best, we get a possibly optimistic estimate of a fraction of Larrabee's die size, though given it is mostly x86 it would be a majority of the die.
     
  14. Shadowmage

    Newcomer

    Joined:
    Sep 30, 2005
    Messages:
    60
    Likes Received:
    3
    Also, NVIDIA's GT200 has a very very low density (think non-fill-cell-area/total-area), so NV can do much better in terms of perf/mm^2, etc. I would think that something with a very optimal, regular layout can achieve much much more (like RV770, sorry for the low blow NV :p )
     
  15. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Well, considering that Larrabee is not a 'thoroughbred' GPU, but relies on software threads to implement certain functionality, I don't think the perf/mm^2 will be anywhere near optimal for Larrabee.
     
  16. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    Well in 'computer architecture' or at least how I have always understood it an emulator is a program that reproduces the functionality of some hardware stuff, let say a functional emulator (have you ever played with M.A.M.E. or any other of the old arcade or console emulators?) and a simulator is a program that simulates the 'timing' and limitations of the hardware being reproduced. A simulator is used for performance evalutation. Usually a simulator includes or works along a functional emulator. The emulator would be used by the software teams to start the early testing of your 'software' (or drivers). The simulator is used by the architects to prove that the P4 at 5 GHz will fly ... and then comes all those damned RTL and silicons guys and screw everything and you get a small section of nuclear reactor core at your motherboard :lol:

    I have had already a few discussions with electrical engineers telling me that is completely the other way around or that emulator means hardware and simulator software (that's the FPGA vs RTL tools like VCS).

    For that job offer I'm not really sure what it means it's either FPGA or just functional emulation for SDK or early software development. But mixing so many concepts is confusing and could also mean that who is hiring doesn't actually knows where exactly the hired one will finally land :wink:
     
  17. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    Hey, only the truth hurts! :) (and this case, that means it probably hurts a lot...) - I think it's pretty mind blowing that they decided to go down that road to maximize yields, while simultaneously having *no* coarse-grained redundancy for the GTX 280 and presumably little fine-grained redundancy. Doesn't make any sense to me unless for a bizarre reasons parametric yield problems increase super-exponentially as you increase density. Of course, maybe their flow encouraged non-regular but I suspect that's probably not the main issue.

    In my opinion, the main difficulty for next-generation architectures will be to find the right mix of fine-grained redundancy, coarse-grained redundancy, and asynchronicity. By the latter, I mean different clock speeds for otherwise identical parts of the chip in order to fight variability. At the same time, triangle setup rates should be expanded significantly above 1T/c, which imposes complex restrictions of its own throughout the architecture.

    Each potentially asynchronous block would need to be big enough to minimize overhead and allow for sufficient fine-grained redundancy (so as to make 1 in X defects due to high density irrelevant), while at the same time being small enough to make coarse-grained redundancy viable and allow for good performance scalability. Fun stuff! (or hey we could just do it pseudo-randomly, wouldn't be the first time :D)
     
  18. Shadowmage

    Newcomer

    Joined:
    Sep 30, 2005
    Messages:
    60
    Likes Received:
    3
    That's a very good point. Architecturally, Larrabee isn't very efficient due to the fact that it's very general purpose. However, I bet that Intel can meet or surpass ATI's RV770 transistor density easily.
     
  19. Shadowmage

    Newcomer

    Joined:
    Sep 30, 2005
    Messages:
    60
    Likes Received:
    3

    I think that they were just lazy because of no competition :p
     
  20. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    300
    Location:
    UK
    Nearly by definition, higher flexibility implies a larger ratio of control logic. While this should not be exaggerated, I think it would be hard to argue that helps to improve density... (control logic is neither naturally regular nor easy to tweak manually at a fine level). I don't disagree with you at all that Intel has a natural advantage here, but they don't have everything going for them either.

    Well, if you really wanted to be a nice guy you could also argue they over-prioritized DX11 because of the Larrabee threat as well as 40/45nm because of the short G80->65->45nm time gap. However I'm not convinced they did that or got lazy. I prefer the simpler explanation that they just screwed up. At the same time AMD came out with a wonderful incremental improvement.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...