Larrabee at GDC 09

Discussion in 'Architecture and Products' started by bowman, Feb 16, 2009.

  1. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Bugs linger on in the windows drivers.

    a dual screen set-up for instance, the intel driver keeps resetting the monitor placement so you're almost forced to put the screen on the side that the intel drivers want it.

    Overlay hardly ever working (or resetting the color values to 0,0,0 upon playback of quicktime. things like that. they're utter, utter crap.
     
  2. Heinrich4

    Regular

    Joined:
    Aug 11, 2005
    Messages:
    596
    Likes Received:
    9
    Location:
    Rio de Janeiro,Brazil
    you right, perhaps i dont understand right cause this link say this:

    "The first encounter with the photograph of silicic plate, which contains larrabee video chip, confirmed that the core area of these chips will not exceed 300 sq. mm. However, later in the network appeared a clearer photograph , which allowed some to estimate the crystal area Of larrabee.

    Associate note that the core area of these chips is within the limits of the expected values..

    larrabee plate 45 nm

    On the silicic plate with 300 mm diameter are placed approximately 86 chips larrabee, although some sources call another number - 64 piece.
    larrabee plate 45 nm

    One way or another, the core area of elder version larrabee in this case can reach 600 sq mm. Wolfdale processors with the core area 107 sq mm have 410 ml transistors. Almost six times larger chip can place about 2 billion transistors. It would be desirable to hope that this entire speed potential will be used effectively. As explained Intel representatives , series products larrabee will appear in the beginning of the following year. "
     
  3. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,505
    Likes Received:
    424
    Location:
    Varna, Bulgaria
    On a second look at the wafer shot, considering that there is approximately 2mm edge clearance, intended for handling, align and safety marks, now the estimated die-size comes pretty close to 617 mm², actually.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    According to this:

    http://portal.acm.org/citation.cfm?id=1413409

    which I can no longer access (luckily I managed to get the PDF a while back), in the 80 core Terascale chip FMAC has a latency of 9 cycles.

    Jawed
     
  5. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,664
    Likes Received:
    184
    On these 600 mm² ~ 617 mm² Larrabee chip, how many cores are we talking about: 12? 16? 24? 32? 48? 64?
     
  6. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
    Who knows. They haven't alluded much to core size, or ring network die space, or which/how many MCs there will be.. So write the numbers on a piece of paper and throw a dart, that's a good way of guessing.:lol:
     
  7. V3

    V3
    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    3,304
    Likes Received:
    5
    Using Intel estimates of around 10 mm2 per core, 48 at best, or 16 if it is just a test. So its probably 32 or 24 perhaps with some cores disable.

    I doubt its 64. On that die size it will be a tight fit for Cell with 64 SPUs. The Larrabee cores are probably at least twice as large as those SPUs or even more.
     
  8. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,724
    Likes Received:
    194
    Location:
    Stateless
    So larrabee core could be well above 40millions transistors if we split evenly all the extra logic (ring bus, texture units, memory controller, etc.) among the cores.
    Intel could still manage to ~20 cores within the same transistors budget as a RV740 and be in the same ballpark as ATI in regard to the theoretical peak figure (~1GFlops). The problem for Intel is transistor density it looks like they are at a disadvantage against TMSC 40nm process (someone bring the figures earlier) they would end with a bigger chip. Is that right?

    I've some questions now that more stuffs about larrabee are public.
    It looks like the larrabee cores has ended up pretty big so what do you about some design choices? (I'm calling on armchair experts :lol: ).
    Ramdomly (out of head I don't know much so don't be mean if some question are stupid):
    What do you think about VPU width? Would 8 wide or less have been better in some regard?(cache line size Vs vector width, better use of gather? // impact on clock speed? // more narrow units may be made cleverer?).
    Would it makes sense to have less than 256KB of L2 per core or you think it's pretty optimal?
    Intel stated that L2 latency will be around 10 cycles, is this critical for the kind of work loads larrabee will run (especcially as hyperthreading may hide some latencies)? Could Intel save on power budget here wtih slower L2 cache // clock the chip higher?
    Could the inclusion of a L0 cache help the design while removing pressure on L1 cache (I mean Intel could design it with higher latency and save on the power consumption side // clock the chip higher).
    Overall may Intel have go with narrower cores but more power efficient and higher clock?

    I'm not implying that Intel did some wrong choices or that I've enough knowledge to criticize theirs choices but I saw really few critics on the design so I wonder what you guy would changed to the design.
     
  9. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Could be, but it is hard to tell.
    The process size (40 nm or 45 nm) refers to the smallest possible feature.
    This may or may not reflect how the average feature size relates. For example, if you compare Intel and AMD processors on the same process size, the average transistorcount per area is not the same (or other aspects for that matter, such as leakage, power dissipation, clockspeeds etc).
    This partly has to do with the architecture itself (eg cache can be very 'compact', where more complex logic may have lower transistorcount per area), but also partly with tweaking the design to get the best possible yields from the manufacturing process, or perhaps to have more favourable power efficiency, etc.

    Intel's process is obviously very mature (has been in mass-production for quite a while, with very good yields), and Intel also uses new materials like hafnium, which I don't think TSMC will use.
    I think it may be very close.

    I personally want to wait until I see some actual software running on the thing.
    Since the whole rendering algorithm is different from what nVidia and ATi use, it's really hard to compare performance by just looking at some vague specs.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The 80 cores Terascale/Polaris uses 100M transistors to implement a "cache-less", grid-communicating array of 160 FMACs that can run at multi-GHz. Each core is only 2 FMACs, which means there's practically no SIMD-wise saving.

    The 16 VPU lanes of each Larrabee core could cost no more than 0.5M transistors each. The design appears to be deliberately cheap with the only concessions being L1 routing and 4 threads (and consequently large, for Intel, register file).

    I guess we're looking at cores costing about 25M transistors (excluding L2).

    Jawed
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,346
    Likes Received:
    3,864
    Location:
    Well within 3d
    Larrabee's starting point is an x86 core, with standard caches, extra vector units, and some claim its RTL comes directly from an already existing core.

    Given the context of the x86-based design, the cost of narrowing VPU width is that the peak numbers can only be brought up by expanding some rather expensive things.

    Clock speed: probably already pretty high for an x86 core with a short pipeline.

    Extra SIMD units: extra wide instruction decoding and wider superscalar issue = additional transistor cost and further modification of the base design
    More cache ports would be needed, more resources would be needed outside of the FP ALUs to handle more address calculations and emulation loop overhead.
    If we go back to the premise that Larrabee's cores are leveraging an already existing design, these are not going to be added.

    Extra cores = extra cores and all the corresponding hardware resources, extra cache tiles, and heavier demands on the ring bus

    The width of 16 probably sits near an optimum point of average utilization and hardware cost.

    There are probably loads that might want more, and others that want less.
    The competition can't match Larrabee when it comes to capacity in the L2.
    L1 and register file capacity on Larrabee is inferior in comparison to the register files of GPUs, though.
    I suppose testing in the real world will tell, but the software infrastructure should be able to tune itself to the constraints of the L2's capacity.

    I'd wonder if the cache structures are the limit in either timing or power.
    The vector and texture units would probably dominate.

    Given that Larrabee leverages an existing design, injecting random new cache levels would not be an option.

    Given the constraints that were imposed on the design, Larrabee is probably in line with the heavier per-core cost and more flexible but higher cost memory subsystem.
     
  12. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,331
    Likes Received:
    118
    Location:
    San Francisco
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I find it puzzling why this algorithm wasn't created before chip design even started. It may not be the corner stone of the architecture, and it doesn't seem like it was much of a sweat to draw up, either.

    How was the choice of 16-wide VPU made and was rasterisation, per se, an important factor? I suppose if you want "square" powers of two then between 4, 16 and 64 there really isn't much choice!

    And it leads smack into the divergence penalty. Looking at fig 29:

    http://www.ddj.com/hpc-high-perform...ionid=CTL4XDMAKYILUQSNDLOSKHSCJUNN2JVN?pgno=6

    that triangle covers 5 tiles, 15 pixels and 7 quads. It's shading 80 pixels to produce 15 results, 19% utilisation. If the quads were packed (i.e. a bit of conditional routing) then it'd be shading 32 pixels for 15 results, 47% utilisation.

    Alternatively, if the rasteriser works on strips/patches of triangles, then the tile masks will tend naturally towards being all set as multiple triangles per tile are shaded concurrently. Or, rather, if the setup engine takes tile masks for contiguous triangles then it can create shading tiles that are maximally populated.

    That's then a sorting problem to find all the triangles in the bin set that are contiguous (or at least non-overlapping). Or it could be just a naive "does this triangle share an edge with the prior triangle?" test.

    Jawed
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,008
    Likes Received:
    535
    I find it hard to believe the algorithm wasn't already well known by the Larrabee developers given Intel's history with tilers, it's a rather trivial variation of the way PowerVR works for instance (and probably all tilers).
     
  15. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,433
    Likes Received:
    181
    Location:
    Chania
    Does TBR even work on their IGPs to date? Afaik it's even a hw implementation (that sucks).

    ***edit: where I forgot to note that the chipset team doesn't have much to do with the LRB team afaik.

    Obviously there will always be similarities as of course differences that aren't always visible on first sight.

    Dumb question: is there any indication that LRB's driver could sort triangles? (no not PowerVR related LOL; Olick was mentioning something about triangle sorting in his Siggraph presentation and I'm just curiours...).
     
  16. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,331
    Likes Received:
    118
    Location:
    San Francisco
    Actually this is quite simple to solve in software, you just have to pack 4 non empty quads in a qquad.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,558
    Likes Received:
    600
    Location:
    New York
    Yeah but by the time you've figured out which quads are non-empty you've already paid the price.
     
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,331
    Likes Received:
    118
    Location:
    San Francisco
    The rasterization algorithm already determines which quads are empty and which are not :) So it's just a matter of filling a qquad with non empty quads (as much as possibile) and to gather/scatter (or to perform multiple loads/store, one per quad, whichever is faster) the quads in the proper place. It could potentially big a big win if you have lots of small triangles.
     
  19. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,900
    Likes Received:
    2,225
    Location:
    Germany
    Maybe someone can shed light on wheter this would also affect Multisampling since the render targets have to be sampled a lot more fine-grained and in course 32 bit might not prove enough. Or would the max. tile size just have to be reduced by a factor according to the MS-level?
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,558
    Likes Received:
    600
    Location:
    New York
    I thought Jawed was still referring to the rasterization process when he said "shading". Since the 80 pixels in figure 29 obviously won't all be sent on to the pixel shading stage regardless of the rasterization algorithm employed. There should only be 7 quads or 2 qquads sent to the pixel shading stage for this triangle.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...