Larrabee at Siggraph

A 48 core one would be 768FLOP/clock, @2ghz =3TFLOP MADD?
15MB cache.

Since Larrabee has very little fixed function / graphics only hardware, it needs to outrun it's contemporaries by a fair margin on pure speed just to remain competitive. 4870x2 can right now do 2.4 Tflops. Though it has been speculated that rv8xx has only 20% improvement in in horsepower, I suspect there will be more generous improvement if only defensively. Which is why I feel either the clock or the core count (or both) will be more than 48 / 2G respectively.

I havent seen the L2 cache numbers reported anywhere. Is that 15 MB number for L2 cache?
 
OK so all this stuff about 'coding a software renderer as if its x86' is pure horseshit.
Scheduling & API is all thats 'in software' & the API is already in software for ATI/Nvidia anyway.

Well yes and no. The difference between ATi/nVidia and Intel isn't as large as Intel wants you to believe.
You could code a software renderer with eg Cuda aswell, if you so desire. In fact, nVidia has an offline renderer by the name of Gelato, which has been around since the GeForce 6.
It's not x86, but other than that it's a 'software' implementation of a renderer, implementing various features and rendering techniques that are not natively supported by the hardware, but can be implemented through software routines.

To write your own software renderer to make use of Larrabee, you'd need to write your own everything including a scheduler that can load balance & hide latency across tens of cores & many many threads!
Presumably Intel would provide libraries to provide those functions for you but then you're still using a 3rd party API.

Yes, but the idea behind all that is that you can implement an alternative renderer, and not be bound by the limits of the Direct3D or OpenGL programming model.
So there will be virtually no limits on what kind of drawing primitives you use, what your shaders can and cannot do, what kind of textures you use etc.
Intel has been hinting at raytracing pretty obviously.

So back to the hardware, clock for clock, ignoring special function for the time being (emulated at reduced speed on the x86 int cores?) & assuming that the Larrabee Vec16 only does MADD:

RV770 is 10*(16*(1+1+1+1+1) = 800 SP right?
A 48 core Larrabee would be 48*(16*1) = 768 SP

So one ATI 16*(1+1+1+1+1) VLIW SIMD = 5 Larrabee cores

It doesn't quite work that way. If you apply this logic to ATi vs nVidia, then ATi would have the fastest GPU by far. Yet it is nVidia that comes out on top.
The reason is that nVidia uses its processing units in a completely different way, making it far more efficient than ATi's.
So aside from the number of processing units, one big unknown in this story is how efficient Intel's units will be in practice.
We now know that they will use tile-based rendering, which is quite a different approach from nVidia/ATi. This makes it even harder to make direct comparisons. Basically we know neither the hardware nor the software driving Direct3D/OpenGL applications for Larrabee.
Also, don't forget that each core on Larrabee will get 4-way SMT (HyperThreading). My suspicion is that they will use this 4-way SMT to 'multiplex' shading and 'fixed-function' operations in their rasterizer. One will mostly use the x86 integer units, and the other will use the SIMD unit, so you will get nice parallelism.
 
I've read a couple of articles and didn't see explicit mention of this - do we know what the ratio of cores:'texture units' is?
 
I've read a couple of articles and didn't see explicit mention of this - do we know what the ratio of cores:'texture units' is?

I haven't seen it either, but I can tell you their slides specifically mention that if they hadn't put texture blocks in that, "Code would take 12x longer for filtering or 40x longer if texture decompression is required". So it clearly was performance driven. . . .
 
Looks like people get less sceptical of Larrabee everytime Intel releases more info.

Well, if people don't get less skeptical the more detailed info you give them, you might want to consider if you're doing something seriously wrong. :LOL:

I can't wait to see the first demonstrations of their technology, which should be sometime later this year.
The theory sounds quite interesting, but how well will it work in practice?


As David Kirk observed recently, "slideware is always perfect". Let's see what happens when devs start reporting back from having used the dev samples that are expected later this year. Will DX compatibility be bullet-proof and reasonably performant? They don't have to kick everybody's else's butt right out of the gate, but "you only get one chance to make a first impression". So making a solid impression of "hey, this thing is pretty decent and likely to get better the longer they keep working at it" would be a very big plus.
 
Oh yes, the origin of a rumor that's been floating around, methinks:

Intel Fact Sheet said:
The Larrabee architecture has a pipeline derived from the dual-issue Intel Pentium® processor, which uses a short execution pipeline with a fully coherent cache structure.
 
The tidbits shown thus far are interesting, though it's a bit short of what I'd call substantial at this early date.

The core area wasn't disclosed from what I've seen. Anandtech seemed willing to speculate from a speculative musing Intel chose to disclose, but the cores are so small that even a few milimeters would be enough to throw things off by a significant amount.
The chip area was also not indicated.

Power numbers would be hard to come by, with perhaps only very preliminary silicon available.

Actual performance numbers would be hard to come by as well, but don't we all just love scalability graphs with no baseline? ;)

If all cores can participate in setup, perhaps we won't be complaining that Larrabee is setup-limited in the future.
 
I haven't seen it either, but I can tell you their slides specifically mention that if they hadn't put texture blocks in that, "Code would take 12x longer for filtering or 40x longer if texture decompression is required". So it clearly was performance driven. . . .


Yeah, looking at their one slide that shows the texture logic, it's just in one block off to the side. I'm sure that's not representative of ratio though :p

Not all articles mention this, but it sounds like there's also other fixed function logic for parts of render setup.

Too bad they couldn't provide simulated performance numbers for the games, asides from speedup and bandwidth usage. Although it's understandable since it may differ from the final products if there's still some architectural decisions to be made (like, for example, tex:core ratio).

I'm excited by it. I love the idea of a CPU/GPU hybrid that can be reasonably competitive with peer GPUs. Obviously we've no idea if Larrabee will be so, but I hope it is. I would love for a console manufacturer to put two 200-250mm^2 cpu/gpu hybrids into a console, instead of discrete gpu and cpu. A nice homogenous chunk of processing power spread across two chips that can scale up to 2x the capability of one peer GPU of the time, if required.. :) ~dreams
 
Why can't Intel slap 16 wide vector ALU's onto current CPU's (Nehalem etc...) rather than the current SSE units?

Wouldn't that provide a massive performance jump in the type of code that benefits from this for a relatively small size increase?
 
If all cores can participate in setup, perhaps we won't be complaining that Larrabee is setup-limited in the future.

Yeah, it'll be interesting to see what other chokepoint "wins" they might have. But might some of those be forced by DX? By that I mean, don't you potentially lose some of that that perfect flexibility as soon as you are in the context of what an API is enforcing?
 
Why can't Intel slap 16 wide vector ALU's onto current CPU's (Nehalem etc...) rather than the current SSE units?

Wouldn't that provide a massive performance jump in the type of code that benefits from this for a relatively small size increase?

The units would have to be fed, and how Larrabee does it exactly wasn't spelled out.

Core2 upped the maximum size of its cache reads to match the wider SIMD units. A balanced general-purpose design with 16-wide vectors would have to quadruple the read size all over again.

Given the heavier infrastructure for memory traffic that OOE and speculation (at 3 GHz) bring in, I would imagine it's more complex to shoehorn in such a large unit.
 
Yeah, it'll be interesting to see what other chokepoint "wins" they might have. But might some of those be forced by DX? By that I mean, don't you potentially lose some of that that perfect flexibility as soon as you are in the context of what an API is enforcing?

You'll lose a lot of flexibility I suppose, DX9/10 are still quite limited in how you can input/output data, even compared to something like Cuda, and Larrabee will probably be even closer to a regular CPU in terms of flexibility than Cuda already is.
Question is how much of those limits will be relieved by DX11... and how successful Intel will be with convincing developers to go 'outside' the API, and develop their own software renderers instead.

Intel seems to think of the DX9/DX10 (and DX11?) compatibility as just a 'transitional phase'. They have bigger plans for the future, supporting D3D/OGL is just a necessary evil at this point.
 
Yeah, it'll be interesting to see what other chokepoint "wins" they might have. But might some of those be forced by DX? By that I mean, don't you potentially lose some of that that perfect flexibility as soon as you are in the context of what an API is enforcing?

I'm not certain I'm going out on a limb by saying I think the answer is yes.

On the subject of chokepoints:
The entire "binning" scheme description didn't quite get into how those bins are populated.

It's not clear to me how Intel's scheme can be perfectly distributed to its many cores, when the basic step of determining a triangle's coordinates for the sake of binning is one GPUs have so far not distributed to their hardware.

We know that Intel claims to have beaten various GPU bottlenecks, but I'm going to withold my cheers until I know what Achilles' Heels they don't want to disclose just yet.
 
Might it be possible to split cores on work toward a DirectX frame, then some custom work for selective objects in another frame, and then composite the frames?

i.e. mixed dx or ogl and custom rendering?
 
Supporting a standard 3D API is not a necessary evil imho, it's not like tomorrow everyone will start to write their own software renderers.
A fast OpenGL/D3D implementation is going to be the most important thing for Larrabee for years to come, all sort of other improvements are probably going to be exposed as extentions to this APIs.
Only a few brave developers will probably write their own thing from scratch.
 
Question is how much of those limits will be relieved by DX11... and how successful Intel will be with convincing developers to go 'outside' the API, and develop their own software renderers instead.

Intel seems to think of the DX9/DX10 (and DX11?) compatibility as just a 'transitional phase'. They have bigger plans for the future, supporting D3D/OGL is just a necessary evil at this point.

Well, Intel tends to think it is their right to Rule Them All and in the Darkness Bind Them. :LOL:

I think rich middleware would be a big help here.

If they really intend to take over the entire graphics industry, then let's see their IGPs (60% of the market?) go this route as well. Otherwise just on sheer volume (or more accurately, the tiny volume in the first years compared to the total installed base) the vast majority of ISV's who aren't paid boatloads of money directly by Intel for doing it will just laugh themselves silly at the idea that they should write their own renderers to maximize Larrabee performance/results.
 
It's not clear to me how Intel's scheme can be perfectly distributed to its many cores, when the basic step of determining a triangle's coordinates for the sake of binning is one GPUs have so far not distributed to their hardware.
What about distributing draw commands to multiple cores? Though maintaining binning data is probably going to require atomic operations.
 
Back
Top