Larrabee at GDC 09

Discussion in 'Architecture and Products' started by bowman, Feb 16, 2009.

  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,561
    Likes Received:
    601
    Location:
    New York
    Nominally you would write a program that loops over a set of work items. The body of that loop is concerned with a single work-item. The indexing just lets multiple instances of your program (thread) point to different work items. Not sure what you're trying to say here. Of course it's vectorized for you since you're not doing it yourself - there's no explicit mapping of groups of work items to the hardware configuration.
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    I was playing on the built in joke of "scalar" in your text. ie, scalar things don't aren't parallel by definition.
     
  3. JasonCross

    Newcomer

    Joined:
    Jul 14, 2005
    Messages:
    39
    Likes Received:
    4
    I think 12 cores must be WAY off.

    Consider that the P54C (which the Larrabee core is based on) had 16K of L1 and no onboard L2 cache, and weighed in at 3.1 million transistors. It was 4.5M transistors for the Pentium with MMX, which had twice the L1 and 256K of L2, if memory serves.

    LRB's core is based on P54C, with 4X the L1 cache, and 256K of L2 cache, and of course the really big vector unit. How much bigger does the vector unit make each core, in terms of transistor count? Twice as big? Three times? Maybe it adds 10M transistors to each core?

    Let's assume that, making each core some ~13M transistorsr. You could still pack 64 cores in 832M transistors. Add in the texture units, memory controller, and ring bus between the L2 caches, and it's not hard to predict that LRB would be smaller on a 45nm process than GT200 was on 65nm. Way smaller. Intel has also stated that they're keeping the pipelines really short.

    I also highly doubt the 512-bit external memory bus being floated around. One of the key differentiating features of LRB, going back to the first time Intel publicly talked about it, is what they say is FAR less bandwidth necessary to render a scene. Chalk that up to a tiled, deferred renderer and lots of cache per core (big enough that some of the necessary data may still be in there when the next frame comes around). Intel would talk about how Larrabee uses 1/4 to 1/2 the bandwidth per frame of conventional graphics products, averaging somewhere around 1/3.

    Assuming it still requires 1/2 the bandwidth per frame as most graphics cards, you really only need something like a 384-bit GDDR3 memory interface with reasonable memory speeds, or a 256-bit GDDR5 interface with modest speeds, to keep the thing fed.

    I was at Abrash's talk at GDC, and he said, "I can't talk about how many cores there will be, but we're talking teraflops." He didn't say "around a teraflop" or anything, he said "teraflops."

    Granted, that leaves a lot of wiggle room. Colloquially, someone could say "teraflops" if it's 1.5 teraflops, or if it's 8 teraflops. I don't think LRB is going to be 8 teraflops, but I could see 2. 64 cores at 2GHz would be 2 teraflops, right?
     
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,561
    Likes Received:
    601
    Location:
    New York
    Ah, how could I forget :lol:
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Double precision, yeah!

    This spreads its FLOPs around, seemingly having a pair of MADs per processor to achieve 1TFLOP (4 FLOPs per core, 2 floating point units):

    http://techresearch.intel.com/articles/Tera-Scale/1449.htm

    Wider SIMDs and 45nm, at least, should make it smaller. It's 100M transistors on 65nm. It's also very toasty at 3+GHz :shock:

    Jawed
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,332
    Likes Received:
    119
    Location:
    San Francisco
    That would be 4 Teraflops -> 64 cores * 16 lanes * 2 ops (1 madd per clock) * 2 Ghz.
     
  7. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    L2 was still on the motherboard in the Pentium-days (including Pentium MMX). Only L1 was on-die. But yes, L1 was doubled from 16k to 32k on P55C.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,332
    Likes Received:
    119
    Location:
    San Francisco
    IIRC P54C L1 was 8 Kb, not 16 Kb.
     
  9. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    I think it was 8k code + 8k data, and then expanded to 16k code + 16k data.
    So respectively 16k and 32k in total.
     
  10. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Disclaimer=gross approximation
    Say the "scalar part" of larrabee is ~6 millions transistors (to support extra features) that would put the whole core around 18 millions transistors (2/3 of the core is vector unit). To be safer we could consider 20 millions, how many transistors 256k of L2 + associated logic consist in? ~2 millions?
    Could we state that 25 millions transistors is a reasonable figure for a complete larrabee core?
     
  11. crystall

    Newcomer

    Joined:
    Jul 15, 2004
    Messages:
    149
    Likes Received:
    1
    Location:
    Amsterdam
    Only the memory cells of 256 KiB of L2 using 6T SRAM would be over 12 million transistors:

    256 * 1024 (bytes) * 8 (bits) * 6 (transistors) ~= 12.6 million transistors

    That's for a non-ECC protected non-redundant L2. In practice you cannot do away without some kind of data protection and some redundancy so the actual number will be certainly north of that.
     
  12. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    Oups too bad I made a rough calculation 256*1000*7 +healthy rounding (<=instead of 6 to take in account extra logic) but I didn't convert bytes to bits... :oops:
     
    #232 liolio, Apr 16, 2009
    Last edited by a moderator: Apr 16, 2009
  13. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,505
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    Before ECC and redundant cells, there is the basic parity bit per byte, so...:
    256 * 1024 (bytes) * 9 (bits) * 6 (transistors) :wink:
     
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,018
    Likes Received:
    114
    You don't need any additional parity bits if you have ECC. ECC typically requires 8 bits per 64 bits, so same amount as does parity (1 bit per byte usually). Anyway, if you're willing to skip any kind of error detection you certainly wouldn't need either so 8 bits is correct.
     
  15. Heinrich4

    Regular

    Joined:
    Aug 11, 2005
    Messages:
    596
    Likes Received:
    9
    Location:
    Rio de Janeiro,Brazil
    #235 Heinrich4, Apr 16, 2009
    Last edited by a moderator: Apr 16, 2009
  16. JasonCross

    Newcomer

    Joined:
    Jul 14, 2005
    Messages:
    39
    Likes Received:
    4
    Ah, yes. The one op that is two ops. ;)

    Okay, even if you cut it back to 32 cores because of that, 12 cores just seems way, way off. Somebody somewhere is doing estimation thinking of more modern out-of-order Intel CPU cores or something.

    Even with L2 cache at 12M transistors per core, and the rest of the core at like 20M transistors (we're starting to highball it here), you're still at "only" a billion transistors for 32 cores. You could double that to 64 cores and include the ring bus and texture units and still be, on 45nm, the same size or smaller than GT200 was on 65nm.

    If I had to guess, and this is only a guess, I would think Larrabee will have 64 cores or more in its high-end configuration. Now, clock speeds are a real interesting bit. Who knows? A big, hot, dense, power-hungry chip with deliberately short pipelines spells slower clock speeds. But it's Intel.

    As for graphics performance...I've spoken with Intel folks in the know and they won't tell me how it's looking. They will say that they know they have to be faster than the current cards, because they'll be up against the next-gen from Nvidia and ATI. I believe their catchphrase for their chances against those is, "It would be arrogant of us to say we'll be faster when we don't even know how well those cards will perform."

    To me, the real issue Intel will need to step up to the plate on is drivers. The GMA products are famously problemmatic with lots of games, not just in performance but in compatibility and rendering glitches and stuff. Their control panels and other desktop software for EVERYTHING (motherboards, etc) are just TERRIBLE. Ugly, poor interfaces, poor options, and so on. They need to catch up to the years and years of experience and relationships Nvidia and ATI have built with developers and publishers in testing games and fixing graphics glitches, even taking a game that does something "wrong" and make it look right. And they need to deliver the kind of control panel software an enthusiast would expect. It's a tall order from a company that really hasn't done end-user software well, like, ever.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,561
    Likes Received:
    601
    Location:
    New York
    I hope your friends are hedging because that doesn't sound very promising. It seems like the software guys have a long road ahead of them even after hardware is ready. Nobody knows what kind of numbers the thing is going to put up in the end.
     
  18. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Not sure if you've been paying attention recently... but I happen to have a laptop with Intel X3100 graphics... and I found that they have been releasing new drivers almost every month.
    Last year they added full DX10 support (I dedicated a thread to this milestone), and since then they've gradually been fixing bugs and improving performance. It's not perfect yet, but it's improving at an impressive rate.
    The DX10 support is good enough to run Crysis (although my X3100 and 1.5 GHz processor are way underpowered... I guess a nice Core2 Quad with X4500 or so would do much better).
    Aside from that, Intel also offers ClearVideo video acceleration, which looks very nice in PowerDVD.
    They are still very weak in OpenGL, mainly because wgl extensions simply aren't implemented. This means that most software can't even initialize. Aside from that they do support OpenGL 2.1 I believe. But I can't run something like Doom3, it just won't start with the missing functions.

    At any rate, for me it works fine. I get full Vista Aero, I can even develop my DX10 code on the laptop now, without having to resort to refrast, and watching DVDs or other video stuff works great aswell.
    I think Intel has been cleaning up its act in the past 2 years.
     
    #238 Scali, Apr 16, 2009
    Last edited by a moderator: Apr 16, 2009
  19. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I wonder why the widows people complain so much about intel igp drivers. Their linux drivers at least are about as good as drivers get.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...