Larrabee: Samples in Late 08, Products in 2H09/1H10

Discussion in 'Rendering Technology and APIs' started by B3D News, Jan 16, 2008.

  1. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    Well, there is so little information about the vector ISA that I cannot possibly defend it just yet, but depending on the broadcast/permute and unaligned load/store support it could very well be the compiler's dream architecture. I can also see a simple sse2larrabee_vpu.h remapping all your legacy SSE code.
    How many vector registers do they plan to have? I bet it's less than the Xbox360's 128 VMX registers. With branch prediction and mem-ops I honestly don't think we need more than 32 registers.
     
  2. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    Arguing about x86 sucking is like yelling at the wind. Its doesn't really matter, it is and always will be. They tried to compete with it, they all failed for one reason or another. x86 IS computers and computers ARE x86. At this point you can't separate them. Every nook and cranny of non-x86 in the electronics market is being subsumed. Non-x86 servers and high end hardware designs are pretty much on their last legs and won't survive another decade.

    Increasingly x86 is taking over the low end and embedded part of the market place and eventually, it too shall be dominated by x86. ISA is dead, long live x86.

    Really at this point the only 2 ISAs around at ARM and x86. Anything else is just a curiosity. And ARM certainly isn't going to be expanding outside of its niche.

    Aaron Spink
    speaking for myself inc.
     
  3. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    A niche can expand as well..
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    That doesn't obviate the fact that x86 is not a very compelling choice in graphics based on technical reasons. It is market and marketing that drives Larrabee's use of the ISA, not because the ISA is particularly impressive.

    Mostly that no architecture has ever had the amount of design, fabrication, and engineering effort put into it.
    We can thank historical accident and the volume manufacturing of x86 manufacturers for making things as they are.

    I accept that x86's installed base and the process advantage Intel enjoys will be critical to Larrabee's eventual fate.
    That doesn't mean I need to believe slapping a complex and decades-old ISA designed for long-dead programs onto a device intended for future decades is clever, or that it is something to be particularly enthusiastic about.
     
  5. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Agreed. But for such an old architecture it isn't that bad. Old crud aside (memory protection model, segments, call gates) the only thing that is detrimental to performance in the core of the x86 ISA is the condition flags that is set on just about every ALU op and thus creates a whole host of false dependencies. From an assembler programmers view it is, unlike the 68K, butt fscking ugly. But unlike the 68K it doesn't require heroic effort to make a microarchitecture that will execute code fast.

    With all arithmetic done in in the new wide vector registers you have 16 registers for pointers/counters, which is enough for a lot of codes.

    Agreed, some of the choices were pure luck, -born out of the necessity to keep it simple. One thing that x86 does right is not exposing microarchitecture in the ISA, almost everybody else has screwed up here, - except Alpha. It also has a sane memory model (in 64bit and 32bit 4GB mode) unlike eg. PowerPC.

    Cheers
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    I think we can credit the x86 ISA for the threading model Larrabee's design has.
    The requirement that a Larrabee core's threads must act like an old x86 core's threads likely lead to the design the GPU now has.

    The fact that its threads will always need to act like an old x86 core's threads is an impact I will enjoy evaluating in the future.
     
  7. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    Personally (IMHO) I think that's the whole point of the ISA extensions Larrabee instroduces. GPUs has tonnes of parallism to exploit, you can use whatever transistors you can fit on a die. Intel doesn't want their future CPU to be a small wart glued to the side of a GPU behemoth, so they need a device to consolidate all the computational needs for a future computer. If you make a consolidating device it has to be a superset of functionality.

    A couple of generations down the road you will have Flash, DRAM and one big IC in your computer and Intel want to sell you that big IC (and the Flash ram)

    Cheers
     
  8. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    From what I've heard, you're right on target. My information suggests that Larrabee will have 32 vector registers per thread (each 512 bits wide).

    So, that brings up the question, which is better: 128 128-bit registers or 32 512-bit registers? (Cell and XeCPU are the former, Larrabee is the latter). At least for GPU computations, the smaller number of wider registers seems to make more sense to me.
     
  9. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I don't see how Larrabee does that, though. It's not because it could run MIMD x86 code that it'd be very good at it - and unless you can power off the Vec16 unit when it's not being used (but the rest of the core is), that'd be a disaster in terms of perf/watt too.

    I really don't think you can get away from the fact the future is heterogeneous; what Intel is doing here is making sure they'll have all the pieces they need in due time to remain very competitive no matter what happens. And that's obviously a very smart thing to be doing no matter whether Larrabee itself turns out to be a market and/or financial success.
     
  10. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    The future is certainly heterogeneous. The question is heterogeneous (big vs small) cores with basically the same ISA (say, x86) or a mixture of big/small cores with *different* ISAs.

    I would say that Intel is betting that the former option gives them some advantages. It might take a few more generations, but if Intel can eventually unify the Larrabee ISA extensions with its "mainstream" x86 cores, it really could push out a "same ISA" heterogenous chip.
     
  11. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,661
    Likes Received:
    1,114
    All IMO. Larrabee is the first step in the direction of consolidation. Imagine what ordinary programmers could get out of a Larrabee-esque architecture if you just program it like you would any other program and not through some cumbersome graphics-oriented API (and not targetting a multitude of underlying hardware each with its own set of bugs/pathological performance cases).


    Execution units have been clock-gated for ages (well >10 years) when unused.

    Single thread legacy x86 performance will be abysmal on Larrabee, I'm not disagreeing there. But as AP mentions make a heterogenous chip with a few big fat Vec16 capable Core 2 cores and a bunch of small cores and it turns into a problem for the OS scheduler.

    Cheers
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    We can look back to the pre-API days to see how it is to go without those cumbersome APIs.

    The only way the more direct coding method works to prevent a return to those days is: a new x86-type API or abstraction layer, or a single implementation taking up 100% of the consumer graphics market.


    Most likely it will be a problem shared by the OS scheduler, hardware monitoring, the software, the programmer, the compiler, and any VM environment running the software.

    The OS scheduler alone, or any element alone, lacks the needed information to make a determination of the best core to use.
     
  13. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    So based on all the speculation to the architecture. Is it possible at this stage to summarise what the leading theories of exactly what Larrabee is comprised of and therefore what it will be capable of?
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    There are a few Intel presentations and slides that point to a general outline:

    x86
    4-thread in-order core
    expanded vector unit with 64 byte registers (16 SP values) and new instructions
    8-16 DP FP operations per cycle (not sure why there's a range)
    32KB L1
    256 KB L2 per core
    Ring bus capable of 256 bytes per cycle.
    Maximum of 128 GB/sec memory bandwidth.
    The core count ranges from 16-24 on the old slides, with clocks ranging from 1.8 to 2.5 GHz.

    Some have assumed a core count of 32 lately, perhaps the highest SKU.

    The slides indicate a 1-cycle latency for the L1, and 10 cycles for the L2. The latency is not stated for the ring bus.

    The issue width and any issue restrictions are not entirely certain. Dual-issue seems plausible.
    Dual issue on vector instructions allows the 16 DP throughput without using FMADD.
    The number of vector registers is not confirmed. It's most likely at least 16 and possibly up to 32.

    The threading implementation hasn't been stated publically. There are four threads, but how work is apportioned between them is not stated.

    Cache associativity is not stated, and the number of cache ports is not known.
    The nature of the vector implementation and issue width influence the port count.
    The L2 seems to be 256 KB per core, but other stories hint that it may be the case that some kind of sharing or read-only access is possible from other cores.

    The caches will coherent. Whether it's bog-standard coherency or something more evolved is not known. A B3d article with some Intel slides seemed to indicate the cache behavior was more complicated.

    It has been claimed that a cache directory is present to cut down on broadcasts on the ring bus. It has been claimed that Larrabee will support speculative lock ellision. This has not been substantiated publically.

    There have been some presentations hinting at fixed function hardware somewhere on the chip, perhaps for texturing.

    The fates of scatter/gather, fixed-function, FMADD, SSE support, etc. have been debated, but confirmation is lacking at this point.

    Scatter/gather support seems to make sense. It might entail gather buffers to minimize cache pollution.

    FMADD might be in the new extensions, but Intel hasn't been enthusiastic about it in other situations. Dual-issue of vector instructions to two separate FADD and FMUL pipes would have the same DP throughput, but with greater flexibility and no disruption of the semantics of x86 thus far.
    On the other hand, one FMADD unit would be more compact than two separate dedicated units.

    The exact nature of the new extensions is not really known.


    Another set of slides from Intel hinted at a hypothetical in-order 4-thread vector x86 with an area of 10 mm2, possibly excluding an L2 cache.
    Whether this means Larrabee's actual incarnation could be looked into.

    Intel has displayed another in-order x86 recently.
    Silverthorne at 45nm is about 25mm2, including an L2 cache.
    Without the L2, it would drop down to under 20 mm2, and a process transition by the time Larrabee comes out would put it around that hypothetical Intel core.

    Silverthorne may not be more than a rough guide to the philosophy Larrabee has.
    Recent stories indicate that while it is a compact in-order x86, it is only dual threaded and it has a restricted dual-issue.
    Larrabee's vector resources, apparently larger L1 D$, and more expanded threading don't make me think it would have fewer transistors than Silverthorne (minus L2).
    If speculation is right, however, it is possible that Larrabee on 32nm would shrink down to the neighborhood of the hypothetical core.

    edit:

    Something occurred to me about Larrabee being x86.

    As an x86 implementation, all the work that goes into Larrabee becomes free game for AMD to reimplement.
    The interesting facet with the early forms of Fusion is that the features of the GPU are still separate, and might not be transferable to Intel.
    Larrabee, by implementing Larrabee directly as x86, does not have this exclusivity.

    On the one hand, it means Larrabee in a sense leaves a path for AMD to maneuver.
    On the other, it does negate part of the justification of the ATI purchase if AMD caves and goes for a Larrabee-type solution.
    If later variants of Fusion pull the GPU onto the x86 core, all those GPU widgets fall under the umbrella of an x86 implementation, giving Intel a very broad expanse of new options.
     
    #594 3dilettante, Feb 15, 2008
    Last edited by a moderator: Feb 15, 2008
  15. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Isn't it 2 double precision ops per clock per core? So range is perhaps based on the number of cores.

    Seems like it had better be at least 32 vector registers per thread.

    My prediction is that even with 4-way threading, feeding a single in-order core is going to require lots of registers to have enough in process to hide various latencies (even more so if dual issue). At some point in the pipeline there is a split between the four input streams (front end) and the single stream of instructions being processed by the core (backend). Without any form of instruction re-ordering a stall in the backend pipeline could stall all the input threads. Then perhaps either flush the instructions from the stalled thread (perhaps leaving bubbles in the merged single core instruction stream) say in a serious case like a cache miss, or eat the stall (across all threads) in say a minor case like instruction dependencies.
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My description of the flop count is per cycle per core.
    The number of cores wouldn't affect the 8-16 range.

    Intel may not have settled on a flop count at the time, wanted to leave things ambiguous, or Larrabee has some kind of peculiarity to how its peak flops can be counted.
     
  17. Mike11

    Regular

    Joined:
    Mar 12, 2008
    Messages:
    250
    Likes Received:
    0
    Why does everyone think Larrabee will be 45 nm? And i mean the Larrabee that everyone can buy, not the architecture sample in 2008. 32 nm is ahead of schedule and should start earlier in 2009 than 45 nm did in 2007. By the end of 1H/10 Intel will most likely produce more 32 nm chips than 45 nm (in Q2 compared to Q3 this generation with 45 nm). So i don't see the problem with a 32 nm Larrabee in, say Q1/10. And since it will start as a high-end GPU they won't need a lot of them (compared to their CPU numbers it will only be a hand full).
     
  18. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Okay, okay, I've got to admit that every single bit of information I've heard in the last few months has been implicitly hinting that the prototype is 65nm or 45nm and that the production chip will be 32nm. BTW, just so I could calculate the number of tflops... What was the concensus for the per-core flops? My understanding has been parallel single-cycle Vec16 ADD and MUL units, but I'm not sure that's accurate.
     
  19. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    It seems that before the delay, Larrabee would have been on 45nm. But with the delay, it seems that 32nm is possible. I guess this brings up the question: what caused the delay? Was the first version of Larrabee too slow? Did they feel they needed more time to optimize the software (or hardware) using an actual prototype before turning it into an actual product? Or was it simply that the hardware design fell behind schedule?

    I don't have any information, but I suspect it is some combination of the above causes. With such a new system, having a 45nm prototype for software developers to work with before launching a product at 32nm with fully tuned software (including physics acceleration) might make lots of sense.
     
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    32 cores on 45 nm sounded unrealistic anyway..
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...