Larrabee: Samples in Late 08, Products in 2H09/1H10

Discussion in 'Rendering Technology and APIs' started by B3D News, Jan 16, 2008.

  1. B3D News

    B3D News Beyond3D News
    Regular

    Joined:
    May 18, 2007
    Messages:
    440
    Likes Received:
    1
    When Doug Freedman asked Paul Otellini about Larrabee, we didn't think much would come out of it. But boy were we wrong: Otellini gave an incredibly to-the-point update on the project's timeframe. Rather than try to summarize, we'll just quote what Otellini had to say here.

    Read the full news item
     
  2. zsouthboy

    Regular

    Joined:
    Aug 1, 2003
    Messages:
    563
    Likes Received:
    9
    Location:
    Derry, NH
    That's sooner than I had guessed.

    Hmm.
     
  3. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    Interesting.

    So is Larabee a pure x86 core array, or does each core have co-processing and fixed function with it?

    If its pure x86, I dont see it doing that much better than an 8 core Nehalem

    (edit: by saying dx11 functionality, I mean floating point co CPUs and fixed function units that allow dx functionality to map on to the architecture.)

    However as a hybrid multi core chip where by you can have dx11 graphics functionality, but with so many more bells and whistle functionality from the x86 cores it could be awesome no? Imagine how amazing it could be for a graphics sub system not to have to go back to the main CPU and main memory for any of its calcualtions.

    The main CPU would just become a co processor as and when needed, but not in any way get slowed down by the PCI e bus and rest of system. Plus I wouldnt be surprised if they did a 4GB work station card in 2010, and then have scalability for massive farm rendering, and scientific data calculating.

    I really think it could be awesome if thats the kind of direction it is taking.

    The main issue of course is how do you render a real time scene without having to duplicated the data sets in memory. And if there was a way to distribute the rendering, how would you avoid massive latency and dependancy of each core during rendering?

    Does it take a paradigm shift to find the answers?
     
    #3 kyetech, Jan 16, 2008
    Last edited by a moderator: Jan 16, 2008
  4. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    But 8 core Nehalem is only rumored to be capable of ~200 DP GFLOPs, whereas Larrabee is north of a TFLOP.
     
  5. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    Some intel slides have Larrabee as an array of 16-24 in-order 4-way (SMT,SoEMT,FMT?) cores that support x86 and an expanded vector instruction set. The expanded vector unit(s?) run on 512-bit registers.

    Fixed-function hardware is something of an enigma. I've seen slides and rumors going either way on this, and it may be that some variants of the design won't have any.
     
  6. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    @ShaidarHaran :
    Will the 1TF be as useful as the 200GF ?
    Will it be enough on its own to warrant having the new architecture?
    Will dx11 have enough flexability that a 2TF (and even more in SLI) graphics card of 2010 wouldnt suffice?
    Will 1TF be enough to enable real time rendering on it, without the use of fixed function or co-CPUs?

    @3dilettante : well its going to be very interesting to see what you actually get for your money then. 2 (or more) variants makes alot of sense though, to compete in different spaces well, rather than be jack of all, good at none type system.
     
  7. Killer-Kris

    Regular

    Joined:
    May 20, 2003
    Messages:
    540
    Likes Received:
    4
    All depends on what you want to do?

    N-queens or knights tour, probably not.
    Rasterization or raytracing, probably so.
     
  8. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    From what I know it has both, array of x86 cores and fixed-function HW. FF should be for texture sampling I assume, it is awfully expensive to do with programmable CPU. Basically this is what Sony used to show that RSX can achieve around 1TF computing power, they simply used the sampling HW and calculated how many instructions would a regular CPU/GPU take to calculate the same thing in software :)

    Btw, can anyone make a rough guess how many instructions it takes to take a 16x anisotropic sample from 3D texture?
    Why so? It is not as if GPUs run a lot of OoO code with lots random reads from RAM.

    I see no reason why would it be any different than it is today
    Actually I have no idea what will be the Nehalem speed but I do know tha Intel said 8-core Gesher* at 4GHz can achieve that speed (page 31). Larrabee is stated to be at 1.7-2.5GHz with 16-24 cores achieving from 0.2-1TF/s. I remember some slides showing that Larrabee will be at 48 cores in 2010. Assuming a bit higher clock speed to go with that I wouldn't be surprised if we had around 2-3TF to play with in 2010 :)


    *) Got the PDF before they castrated it , yay for browser caches ;)
    Assuming that texture sampling is still done (mostly) in dedicated HW I'd say yes, you can do pretty good real-time rendering on Larrabee.


    Anyway, here should be pretty much all the information about Larrabe known so far. If there is something missing just tell me :)


    [edit]

    I remember some link showing that Larrabee has two memory controllers on the chip, both at opposite ends of the ringbus and fixed function was at there also. That sounds quite logical too assuming that sampling is there. When a core asks for a sample it likely has to go through the memory controller anyway.

    [edit2]
    Duh, it was on that same .pdf on page 16 :D
     
    #8 hoho, Jan 17, 2008
    Last edited by a moderator: Jan 17, 2008
  9. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
    Theoretical maximum G- or TFLOPs are a chapter of it's own, if one considers the gap between theoretical maximum on paper and real time throughputs.

    Irrelevant of the above, I'd be very surprised if both GPU vendors won't have exceeded the 2TFLOP mark a lot earlier than some would speculate for something that is to arrive the earliest in late 2009.
     
  10. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    True but assuming that Larrabee has somewhat similar efficiency to Cell SPU's it might be quite close to peak performance, at least under graphics workload that gets optimized to oblivion :)

    I wouldn't be so sure about that. More than a year ago we had less than 0.5TF with 8800GTX. What is the peak today (IIRC not much over 0.5TF)? Next generations will be close to around 1TF but they will likely not be availiable before middle this year. Add 1.5 years to that and with doubling of speed we will have 2TF by the end of 2009, same time of Larrabee supposed release.

    Of course this is simple linear scaling and real world can work a whole different way :)
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
    In a synthetic arithmetic only application? I don't think I'd particularly care about such a rate as a gamer. IMHLO IHVs should claim X fillrate with Y GFLOPs under Z% of shader load.

    What's a "next generation" to you exactly? I wouldn't be in the least surprised if vendors come close to that mark even within this year.

    And no the time intermitting between the G80 release (Nov.06') isn't an indication against it but rather for it. Take a similar rating like the one I suggest above (or at least one that makes sense but gives closer to real time throughput results) and show me how much performance rose between G71 and G80. Even if you take simply the marketing wash GFLOP number it was 150GFLOPs for the 7900GTX and 518 for the 8800GTX. That's roughly a 3.5x times increase in only 3 quarters time within the same year.

    All I'm saying is that you shouldn't underestimate GPU vendors "that" much.
     
  12. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    No, under a real-world game that is shader limited. I don't think there is much point in talking about non-shader limited scenarios when we talk about GFLOPs.

    I don't talk about traditional multi-GPU setups, wheter on one board or in two. R700 with its multichip counts as a single GPU to me. If we start talking about multi-GPU solutions I'm sure one can put several Larrabees in one box too.

    Well, we could also go back to 6800 series and see how much peformance increased per year since then. Average is far from the jump G80 gave.
    I admit I didn't count the "missing mul" on G80 and thus said "under 0.5TF". From what I've understood it is quite difficult, if not impossible, to use it. Of course if Larrabee has similar limitations its peak will be (much) higher than what is achieveable in real world.

    I'm just saying I'm not hoping too much for them to keep on trippling GPU speed in 3 month intervals ;) I'd be happy if they can keep on doubling speed once a year.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    One thing to note is that Larrabee's tentative TFLOP is in double precision.
    If Larrabee follows the x86 tradition, that means single precision would have double the throughput.

    RV670 seems to have a similar cut in DP throughput, with the exception that the complex unit doesn't do DP and DP doesn't mix with an FMADD.
     
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,404
    Likes Received:
    168
    Location:
    Chania
    Then you should really go back and look at the exact timelines and start estimating when it's time for the next major increase I guess.

    I didn't deduct and GFLOPs in my former comparison from the G71 did I? But if one goes deep into each of the two pipelines and sees what has been de-coupled ever since the resulting factor when it comes to performance increases is huge nonetheless. If one is even more mean and tortures both with unoptimized AF, then it'll get into the ridiculous ballpark. In a scenario (which I considered worst case) the G71 was losing roughly 50% from default "quality" to "high quality" AF, whereby the G80 was shining only with a 18% difference.

    What you fail to realize here is that every 2-2.5 years there's a major increase in performance. No one said 3 months intervals and NV's or AMD's spring lineup doesn't look like "twice the performance" compared to G80.
     
  15. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Interesting. Something doesn't add up though. Gesher is a full generation on from Nehalem, which is itself rumored to be 10-25% faster per core than Penryn. With Penryn we already have north of 100GFLOPs with 4 cores. Why would Gesher go backwards in terms of per-core performance compared to an existing architecture, let alone Nehalem?
     
  16. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    It would be a lot easier if we would plot a timeline graph of GPU speed increase over the years. I propose starting from NV40 to get a bit better idea. Unfortunately I don't know the exact GFLOP ratings of earlier series so someone else has to provide them. Information about later series has already been provided.

    [edit]

    I highly doubt Nehalem has many improvements that can rise theoretical peak DP SIMD throughput. Also I think you have messed up DP and SP. To achieve 100 DP GFLOPS a quadcore Penryn needs to clock at around 100GF/4cores/4 flops per cycle = 6.25GHz. Gesher likely achieves its 200 DP GF with 4 DP MADD instructions per cycle or with double width SIMD. Though the numbers Intel gave in that PDF doesn't match it for some reason (7 insturctions per cycle with SIMD?)
     
    #16 hoho, Jan 17, 2008
    Last edited by a moderator: Jan 17, 2008
  17. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    Larrabee has almost no special-purpose graphics hardware

    Larrabee is a bunch of x86 cores (from what I hear, 32 cores with four threads each). However, these cores are augmented with 64-byte vectors, which is significantly wider than 8-byte SSE-style vectors. A 64-byte vector is 16 single precision floating point values.

    If the vector unit is fully pipelined to allow throughput of one vector operation per cycle, that is 16 flop/cycle * 32 cores is 512 flop/cycle. At 2Ghz, that is 1 Tflop/second raw throughput. It would be even higher if you consider fused multiply/add instructions.

    From what I understand, the only special-purpose graphics hardware in Larrabee is special vector instructions specifically tailored for the inner loops of graphics calculations. That is, there is no special GPU hardware blocks, just special graphics instructions. This is a real departure from how other GPUs work, but---if it works---could really change the way we think about GPUs (that is, GPUs just become multi-core CPUs!)

    That is why Larrabee is such a threat to NVIDIA. Larrabee converts graphics processing from the special-purpose hardware domain to *the* killer application for many-core processors.

    This should be an interesting fight to watch.
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,946
    Likes Received:
    2,370
    Location:
    Well within 3d
    It would probably have to be fully pipelined for most, if not all such ops.
    If Larrabee uses SMT, one thread's going to block the other three threads.
    If it uses round-robin FMT or something similar, there's an implied need for single-cycle switch-over.

    If there are specialized instructions, it would hint that there is some kind of specialized hardware to go with them.
    I guess Intel could just run them all through microcode and synthesize them with dozens of standard ops, but that seems wasteful.

    Specialized hardware can accomplish a lot that either takes more time or--more critically for a massive multicore--power on general hardware.
    If we multiply that over 24 cores, things get iffy.

    Intel could be sprinkling some more specialized hardware somewhere in Larrabee, if only for that reason.
     
  19. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    I totally agree.

    Let me clarify my previous post. What I intended to say is that there are new instructions *and* the special purpose hardware ALUs to support those instructions. However, unlike current GPUs, there is no *other* special hardware. No other fragment pipeline or special z-buffering frame buffer (or whatever else GPUs have today). Just many x86 cores with extra vector ALUs for executing the new instructions tailored for graphics processing.

    The key difference is the programming model. For Larrabee, a program can just use inline assembly (or library calls) to insert these vector operations into a regular program. There is no special setup or other low-level implementation-specific poking of the hardware to get the special purpose hardware going. Just as SSE isn't conceptually difficult to add to a program (assuming it has the right sort of data parallelism) these vectors will be similarly easy to use.

    Another key point is that Larrabee has coherent caching (just like Intel's other multi-core systems). Unlike a GPU that requires explicit commands to move data around the system and/or flush caches at the right time, all that is done seamlessly in Larrabee. Instead of burdening the programmer in worrying about all these issues, Larrabee really is just a shared-memory multiprocessor on a chip.

    Beyond providing a familiar programming model for development of OpenGL/DX drivers and other low-level software, this also allows for more dynamic algorithms that can share data and synchronize more finely using locks and such. As advanced graphics algorithms are becoming less and less regular, such support could really provide a big boost in what kinds of tasks and algorithms Larrabee can tackle.
     
  20. ArchitectureProfessor

    Newcomer

    Joined:
    Jan 17, 2008
    Messages:
    211
    Likes Received:
    0
    I don't understand why under SMT one thread would block the other threads. The whole point of threading is to allow the other threads to continue.

    Let's assume for a second that vector operations are fully-pipelined and have a four-cycle latency (that is, a new vector operations can start each cycle, but it takes four cycles to finish). In that case, it could start vector operation one thread in one cycle, a second thread the next cycle, etc. By the time all four threads have started a vector operation, the first thread's operation will be done and ready to execute the next vector instruction.

    Of course, if a single thread has consecutive *independent* vector operations, they would likely just execute in a pipelined fashion without switching threads at all.

    The other key advantage of multithreading is hiding memory latency. If one thread blocks on a cache miss, the other threads keep going.

    Although most systems have a hard time reaching peak performance, having 4 threads per processor to suck up ALU bandwidth will help Larrabee get much closer to peak performance than systems without threads (such Intel's current multi-core chips).

    Of course, the big down side is that now the programs need to generate 128 threads, which isn't a trivial task.
     

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...