Larrabee: Samples in Late 08, Products in 2H09/1H10

B3D News

Beyond3D News
Regular
When Doug Freedman asked Paul Otellini about Larrabee, we didn't think much would come out of it. But boy were we wrong: Otellini gave an incredibly to-the-point update on the project's timeframe. Rather than try to summarize, we'll just quote what Otellini had to say here.

Read the full news item
 
Interesting.

So is Larabee a pure x86 core array, or does each core have co-processing and fixed function with it?

If its pure x86, I dont see it doing that much better than an 8 core Nehalem

(edit: by saying dx11 functionality, I mean floating point co CPUs and fixed function units that allow dx functionality to map on to the architecture.)

However as a hybrid multi core chip where by you can have dx11 graphics functionality, but with so many more bells and whistle functionality from the x86 cores it could be awesome no? Imagine how amazing it could be for a graphics sub system not to have to go back to the main CPU and main memory for any of its calcualtions.

The main CPU would just become a co processor as and when needed, but not in any way get slowed down by the PCI e bus and rest of system. Plus I wouldnt be surprised if they did a 4GB work station card in 2010, and then have scalability for massive farm rendering, and scientific data calculating.

I really think it could be awesome if thats the kind of direction it is taking.

The main issue of course is how do you render a real time scene without having to duplicated the data sets in memory. And if there was a way to distribute the rendering, how would you avoid massive latency and dependancy of each core during rendering?

Does it take a paradigm shift to find the answers?
 
Last edited by a moderator:
Interesting.

So is Larabee a pure x86 core array, or does each core have co-processing and fixed function with it?

If its pure x86, I dont see it doing that much better than an 8 core Nehalem

But 8 core Nehalem is only rumored to be capable of ~200 DP GFLOPs, whereas Larrabee is north of a TFLOP.
 
Interesting.

So is Larabee a pure x86 core array, or does each core have co-processing and fixed function with it?

If its pure x86, I dont see it doing that much better than an 8 core Nehalem
Some intel slides have Larrabee as an array of 16-24 in-order 4-way (SMT,SoEMT,FMT?) cores that support x86 and an expanded vector instruction set. The expanded vector unit(s?) run on 512-bit registers.

Fixed-function hardware is something of an enigma. I've seen slides and rumors going either way on this, and it may be that some variants of the design won't have any.
 
@ShaidarHaran :
Will the 1TF be as useful as the 200GF ?
Will it be enough on its own to warrant having the new architecture?
Will dx11 have enough flexability that a 2TF (and even more in SLI) graphics card of 2010 wouldnt suffice?
Will 1TF be enough to enable real time rendering on it, without the use of fixed function or co-CPUs?

@3dilettante : well its going to be very interesting to see what you actually get for your money then. 2 (or more) variants makes alot of sense though, to compete in different spaces well, rather than be jack of all, good at none type system.
 
kyetech said:
So is Larabee a pure x86 core array, or does each core have co-processing and fixed function with it?
From what I know it has both, array of x86 cores and fixed-function HW. FF should be for texture sampling I assume, it is awfully expensive to do with programmable CPU. Basically this is what Sony used to show that RSX can achieve around 1TF computing power, they simply used the sampling HW and calculated how many instructions would a regular CPU/GPU take to calculate the same thing in software :)

Btw, can anyone make a rough guess how many instructions it takes to take a 16x anisotropic sample from 3D texture?
If its pure x86, I dont see it doing that much better than an 8 core Nehalem
Why so? It is not as if GPUs run a lot of OoO code with lots random reads from RAM.

The main issue of course is how do you render a real time scene without having to duplicated the data sets in memory
I see no reason why would it be any different than it is today
But 8 core Nehalem is only rumored to be capable of ~200 DP GFLOPs, whereas Larrabee is north of a TFLOP.
Actually I have no idea what will be the Nehalem speed but I do know tha Intel said 8-core Gesher* at 4GHz can achieve that speed (page 31). Larrabee is stated to be at 1.7-2.5GHz with 16-24 cores achieving from 0.2-1TF/s. I remember some slides showing that Larrabee will be at 48 cores in 2010. Assuming a bit higher clock speed to go with that I wouldn't be surprised if we had around 2-3TF to play with in 2010 :)


*) Got the PDF before they castrated it , yay for browser caches ;)
Will 1TF be enough to enable real time rendering on it, without the use of fixed function or co-CPUs?
Assuming that texture sampling is still done (mostly) in dedicated HW I'd say yes, you can do pretty good real-time rendering on Larrabee.


Anyway, here should be pretty much all the information about Larrabe known so far. If there is something missing just tell me :)


[edit]

I remember some link showing that Larrabee has two memory controllers on the chip, both at opposite ends of the ringbus and fixed function was at there also. That sounds quite logical too assuming that sampling is there. When a core asks for a sample it likely has to go through the memory controller anyway.

[edit2]
Duh, it was on that same .pdf on page 16 :D
 
Last edited by a moderator:
Theoretical maximum G- or TFLOPs are a chapter of it's own, if one considers the gap between theoretical maximum on paper and real time throughputs.

Irrelevant of the above, I'd be very surprised if both GPU vendors won't have exceeded the 2TFLOP mark a lot earlier than some would speculate for something that is to arrive the earliest in late 2009.
 
Theoretical maximum G- or TFLOPs are a chapter of it's own, if one considers the gap between theoretical maximum on paper and real time throughputs.
True but assuming that Larrabee has somewhat similar efficiency to Cell SPU's it might be quite close to peak performance, at least under graphics workload that gets optimized to oblivion :)

Irrelevant of the above, I'd be very surprised if both GPU vendors won't have exceeded the 2TFLOP mark a lot earlier than some would speculate for something that is to arrive the earliest in late 2009.

I wouldn't be so sure about that. More than a year ago we had less than 0.5TF with 8800GTX. What is the peak today (IIRC not much over 0.5TF)? Next generations will be close to around 1TF but they will likely not be availiable before middle this year. Add 1.5 years to that and with doubling of speed we will have 2TF by the end of 2009, same time of Larrabee supposed release.

Of course this is simple linear scaling and real world can work a whole different way :)
 
True but assuming that Larrabee has somewhat similar efficiency to Cell SPU's it might be quite close to peak performance, at least under graphics workload that gets optimized to oblivion :)

In a synthetic arithmetic only application? I don't think I'd particularly care about such a rate as a gamer. IMHLO IHVs should claim X fillrate with Y GFLOPs under Z% of shader load.

I wouldn't be so sure about that. More than a year ago we had less than 0.5TF with 8800GTX. What is the peak today (IIRC not much over 0.5TF)? Next generations will be close to around 1TF but they will likely not be availiable before middle this year. Add 1.5 years to that and with doubling of speed we will have 2TF by the end of 2009, same time of Larrabee supposed release.

Of course this is simple linear scaling and real world can work a whole different way :)

What's a "next generation" to you exactly? I wouldn't be in the least surprised if vendors come close to that mark even within this year.

And no the time intermitting between the G80 release (Nov.06') isn't an indication against it but rather for it. Take a similar rating like the one I suggest above (or at least one that makes sense but gives closer to real time throughput results) and show me how much performance rose between G71 and G80. Even if you take simply the marketing wash GFLOP number it was 150GFLOPs for the 7900GTX and 518 for the 8800GTX. That's roughly a 3.5x times increase in only 3 quarters time within the same year.

All I'm saying is that you shouldn't underestimate GPU vendors "that" much.
 
In a synthetic arithmetic only application?
No, under a real-world game that is shader limited. I don't think there is much point in talking about non-shader limited scenarios when we talk about GFLOPs.

What's a "next generation" to you exactly?
I don't talk about traditional multi-GPU setups, wheter on one board or in two. R700 with its multichip counts as a single GPU to me. If we start talking about multi-GPU solutions I'm sure one can put several Larrabees in one box too.

Take a similar rating like the one I suggest above (or at least one that makes sense but gives closer to real time throughput results) and show me how much performance rose between G71 and G80.
Well, we could also go back to 6800 series and see how much peformance increased per year since then. Average is far from the jump G80 gave.
Even if you take simply the marketing wash GFLOP number it was 150GFLOPs for the 7900GTX and 518 for the 8800GTX. That's roughly a 3.5x times increase in only 3 quarters time within the same year.
I admit I didn't count the "missing mul" on G80 and thus said "under 0.5TF". From what I've understood it is quite difficult, if not impossible, to use it. Of course if Larrabee has similar limitations its peak will be (much) higher than what is achieveable in real world.

All I'm saying is that you shouldn't underestimate GPU vendors "that" much.
I'm just saying I'm not hoping too much for them to keep on trippling GPU speed in 3 month intervals ;) I'd be happy if they can keep on doubling speed once a year.
 
One thing to note is that Larrabee's tentative TFLOP is in double precision.
If Larrabee follows the x86 tradition, that means single precision would have double the throughput.

RV670 seems to have a similar cut in DP throughput, with the exception that the complex unit doesn't do DP and DP doesn't mix with an FMADD.
 
Well, we could also go back to 6800 series and see how much peformance increased per year since then. Average is far from the jump G80 gave.

Then you should really go back and look at the exact timelines and start estimating when it's time for the next major increase I guess.

I admit I didn't count the "missing mul" on G80 and thus said "under 0.5TF". From what I've understood it is quite difficult, if not impossible, to use it. Of course if Larrabee has similar limitations its peak will be (much) higher than what is achieveable in real world.

I didn't deduct and GFLOPs in my former comparison from the G71 did I? But if one goes deep into each of the two pipelines and sees what has been de-coupled ever since the resulting factor when it comes to performance increases is huge nonetheless. If one is even more mean and tortures both with unoptimized AF, then it'll get into the ridiculous ballpark. In a scenario (which I considered worst case) the G71 was losing roughly 50% from default "quality" to "high quality" AF, whereby the G80 was shining only with a 18% difference.

I'm just saying I'm not hoping too much for them to keep on trippling GPU speed in 3 month intervals ;) I'd be happy if they can keep on doubling speed once a year.

What you fail to realize here is that every 2-2.5 years there's a major increase in performance. No one said 3 months intervals and NV's or AMD's spring lineup doesn't look like "twice the performance" compared to G80.
 
I see no reason why would it be any different than it is today
Actually I have no idea what will be the Nehalem speed but I do know tha Intel said 8-core Gesher* at 4GHz can achieve that speed (page 31). Larrabee is stated to be at 1.7-2.5GHz with 16-24 cores achieving from 0.2-1TF/s. I remember some slides showing that Larrabee will be at 48 cores in 2010. Assuming a bit higher clock speed to go with that I wouldn't be surprised if we had around 2-3TF to play with in 2010 :)

*) Got the PDF before they castrated it , yay for browser caches ;)

Interesting. Something doesn't add up though. Gesher is a full generation on from Nehalem, which is itself rumored to be 10-25% faster per core than Penryn. With Penryn we already have north of 100GFLOPs with 4 cores. Why would Gesher go backwards in terms of per-core performance compared to an existing architecture, let alone Nehalem?
 
It would be a lot easier if we would plot a timeline graph of GPU speed increase over the years. I propose starting from NV40 to get a bit better idea. Unfortunately I don't know the exact GFLOP ratings of earlier series so someone else has to provide them. Information about later series has already been provided.

[edit]

I highly doubt Nehalem has many improvements that can rise theoretical peak DP SIMD throughput. Also I think you have messed up DP and SP. To achieve 100 DP GFLOPS a quadcore Penryn needs to clock at around 100GF/4cores/4 flops per cycle = 6.25GHz. Gesher likely achieves its 200 DP GF with 4 DP MADD instructions per cycle or with double width SIMD. Though the numbers Intel gave in that PDF doesn't match it for some reason (7 insturctions per cycle with SIMD?)
 
Last edited by a moderator:
Larrabee has almost no special-purpose graphics hardware

Larrabee is a bunch of x86 cores (from what I hear, 32 cores with four threads each). However, these cores are augmented with 64-byte vectors, which is significantly wider than 8-byte SSE-style vectors. A 64-byte vector is 16 single precision floating point values.

If the vector unit is fully pipelined to allow throughput of one vector operation per cycle, that is 16 flop/cycle * 32 cores is 512 flop/cycle. At 2Ghz, that is 1 Tflop/second raw throughput. It would be even higher if you consider fused multiply/add instructions.

From what I understand, the only special-purpose graphics hardware in Larrabee is special vector instructions specifically tailored for the inner loops of graphics calculations. That is, there is no special GPU hardware blocks, just special graphics instructions. This is a real departure from how other GPUs work, but---if it works---could really change the way we think about GPUs (that is, GPUs just become multi-core CPUs!)

That is why Larrabee is such a threat to NVIDIA. Larrabee converts graphics processing from the special-purpose hardware domain to *the* killer application for many-core processors.

This should be an interesting fight to watch.
 
If the vector unit is fully pipelined to allow throughput of one vector operation per cycle, that is 16 flop/cycle * 32 cores is 512 flop/cycle. At 2Ghz, that is 1 Tflop/second raw throughput. It would be even higher if you consider fused multiply/add instructions.
It would probably have to be fully pipelined for most, if not all such ops.
If Larrabee uses SMT, one thread's going to block the other three threads.
If it uses round-robin FMT or something similar, there's an implied need for single-cycle switch-over.

From what I understand, the only special-purpose graphics hardware in Larrabee is special vector instructions specifically tailored for the inner loops of graphics calculations. That is, there is no special GPU hardware blocks, just special graphics instructions. This is a real departure from how other GPUs work, but---if it works---could really change the way we think about GPUs (that is, GPUs just become multi-core CPUs!)

That is why Larrabee is such a threat to NVIDIA. Larrabee converts graphics processing from the special-purpose hardware domain to *the* killer application for many-core processors.

This should be an interesting fight to watch.

If there are specialized instructions, it would hint that there is some kind of specialized hardware to go with them.
I guess Intel could just run them all through microcode and synthesize them with dozens of standard ops, but that seems wasteful.

Specialized hardware can accomplish a lot that either takes more time or--more critically for a massive multicore--power on general hardware.
If we multiply that over 24 cores, things get iffy.

Intel could be sprinkling some more specialized hardware somewhere in Larrabee, if only for that reason.
 
If there are specialized instructions, it would hint that there is some kind of specialized hardware to go with them.

I totally agree.

Let me clarify my previous post. What I intended to say is that there are new instructions *and* the special purpose hardware ALUs to support those instructions. However, unlike current GPUs, there is no *other* special hardware. No other fragment pipeline or special z-buffering frame buffer (or whatever else GPUs have today). Just many x86 cores with extra vector ALUs for executing the new instructions tailored for graphics processing.

The key difference is the programming model. For Larrabee, a program can just use inline assembly (or library calls) to insert these vector operations into a regular program. There is no special setup or other low-level implementation-specific poking of the hardware to get the special purpose hardware going. Just as SSE isn't conceptually difficult to add to a program (assuming it has the right sort of data parallelism) these vectors will be similarly easy to use.

Another key point is that Larrabee has coherent caching (just like Intel's other multi-core systems). Unlike a GPU that requires explicit commands to move data around the system and/or flush caches at the right time, all that is done seamlessly in Larrabee. Instead of burdening the programmer in worrying about all these issues, Larrabee really is just a shared-memory multiprocessor on a chip.

Beyond providing a familiar programming model for development of OpenGL/DX drivers and other low-level software, this also allows for more dynamic algorithms that can share data and synchronize more finely using locks and such. As advanced graphics algorithms are becoming less and less regular, such support could really provide a big boost in what kinds of tasks and algorithms Larrabee can tackle.
 
If Larrabee uses SMT, one thread's going to block the other three threads.
If it uses round-robin FMT or something similar, there's an implied need for single-cycle switch-over.

I don't understand why under SMT one thread would block the other threads. The whole point of threading is to allow the other threads to continue.

Let's assume for a second that vector operations are fully-pipelined and have a four-cycle latency (that is, a new vector operations can start each cycle, but it takes four cycles to finish). In that case, it could start vector operation one thread in one cycle, a second thread the next cycle, etc. By the time all four threads have started a vector operation, the first thread's operation will be done and ready to execute the next vector instruction.

Of course, if a single thread has consecutive *independent* vector operations, they would likely just execute in a pipelined fashion without switching threads at all.

The other key advantage of multithreading is hiding memory latency. If one thread blocks on a cache miss, the other threads keep going.

Although most systems have a hard time reaching peak performance, having 4 threads per processor to suck up ALU bandwidth will help Larrabee get much closer to peak performance than systems without threads (such Intel's current multi-core chips).

Of course, the big down side is that now the programs need to generate 128 threads, which isn't a trivial task.
 
Back
Top