Larrabee at Siggraph

Something easier to understand, perhaps, more direct performance working backwards from that chart.

(for a 24-core 1Ghz Larrabee)

F.E.A.R. 1600x1200 4xAA, frames per second per frame in the sample set (rounded figures)

152
125
103
103
87
76
90
111
103
120
96
93
96
99
206
90
115
137
180
160
180
206
55
96
103

Average FPS: 119.28
Minimum FPS: 55
Maximum FPS: 206
Median FPS: 103

So based on this tiny sample set, such a set up might be comparable to 3dilettante's GT200 figure, if not a bit (20+%) better.
 
So 32 cores at 2GHz would average 316fps at 1600x1200 with 4xaa?


(and yeah it is aa, they say "multisampling" --I missed that the first time).
 
So 32 cores at 2GHz would average 316fps at 1600x1200 with 4xaa?


(and yeah it is aa, they say "multisampling" --I missed that the first time).

Based on those numbers, and assuming linear scaling with clock-speed, yes.
 
Seriously, the bit about "We wrote assembly code for the highest-cost sections, ". . . isn't that NV/AMD 3DM optimizing to the extreme? Are they avoiding a whole bunch of DX overhead that way?
 
So 32 cores at 2GHz would average 316fps at 1600x1200 with 4xaa?

There's a number of ifs that must be true for that to be true, but if all else scaled perfectly to the theoretical numbers, maybe.

At about 3x the framerate of GT200's average (early FEAR benchmark numbers btw), we'd also have to consider the fact that while Larrabee is purported to be more bandwidth-efficient, the graphs didn't show it as being 3x more efficient.
 
Seriously, the bit about "We wrote assembly code for the highest-cost sections, ". . . isn't that NV/AMD 3DM optimizing to the extreme? Are they avoiding a whole bunch of DX overhead that way?


Presumably Intel engineers won't assembly-optimise the highest cost parts of each frame in every game we play from here on out.. :LOL:

Maybe their claimed pessimism might balance that out a little..
I want to believe! :p
 
Presumably Intel engineers won't assembly-optimise the highest cost parts of each frame in every game we play from here on out.. :LOL:

Maybe their claimed pessimism might balance that out a little..
I want to believe! :p

Is it not possible that the claimed assembly optimisation is just in terms of how it is handling DX in general, or does it def mean, for the game? Thats where the driver team can make a huge difference right?
 
There's an interesting blurb in there about rasterization:
I guess increasing setup speed in GPUs beyond 1 per clock is indeed a tough task, as dealing with fragments from multiple rasterizers would be tough. I have a feeling this is one reason that Larabee is doing well, as it can have each core work on a seperate set of primitives (they divide it into chunks of 1,000).
Funny how we were talking about just this just before the paper appeared :smile:

These hints (can't read the paper, have read some articles) are pointing to something that's going to be very impressive. The extremely high utilisation you can get when you don't have an extensive pipeline of fixed-function units appears to be turning out better than I was expecting :LOL:

Jawed
 
Is it not possible that the claimed assembly optimisation is just in terms of how it is handling DX in general, or does it def mean, for the game? Thats where the driver team can make a huge difference right?


It's hard to know for sure, but you could read it that they assembly optimised the highest cost parts of their 'functional model' they ran the frames against, and not the highest cost parts of each frame (i.e. a different functional model or driver or simulator or whatever, optimised specifically for each frame).

If it was 'just' assembly optimisation of their driver/model/whatever, then presumably that's acceptable..
 
Approximate, rounded FPS numbers for Gears of War 1600x1200 no AA, again for the sample set of 25 frames obviously:

63
111
120
63
111
120
120
85
85
65
65
96
120
103
96
60
72
63
103
96
85
85
90
85
96

Average FPS: 90.32
Minimum FPS: 60
Maximum FPS: 120
Median FPS: 90

A wodge of FEAR results:

http://www.guru3d.com/category/vga_1/

From the built-in benchmark. Watch the column headings for pixel count.

Jawed

These are averages I guess?

Assuming our hypothetical 2Ghz 32-core Larrabee (side-note: we really need a short-hand for it...), it would rank somewhere just above the 3-way SLi GT280 (again, taking into account the caveats previously mentioned, assuming linear scaling with clocks/cores, that the sample frame set is representative etc.).
 
I guess increasing setup speed in GPUs beyond 1 per clock is indeed a tough task, as dealing with fragments from multiple rasterizers would be tough. I have a feeling this is one reason that Larabee is doing well, as it can have each core work on a seperate set of primitives (they divide it into chunks of 1,000). They even have a different subsection in each bin to store the results from each core.

I hope ATI/NVidia tackle this issue. I think it's doable.

I think it's almost a given looking at how setup limits are popping up in current workloads that the next gen (next-next gen?) GPUs will have to do something to decrease the amount of time spent on setup.

This might be an area where parity may not be completely necessary to make it a pretty much even race.

Would DX11's multithreaded rendering help out here? If multiple threads are capable of sending triangles for setup, could a multiplicity of rasterization units be used, as the software would have already declared them independent?

A wodge of FEAR results:

http://www.guru3d.com/category/vga_1/

From the built-in benchmark. Watch the column headings for pixel count.

Jawed
Those numbers would make things look rosier for the GT200, since the frame rates are higher than the average I googled up and 16x AF was added in to boot.
 
I remember fudo reporting larrabee launch in summer of 2009. That's around gt300/rv8xx are expected to launch. :D

Lots of stuff told and lots of stuff still hidden. Their scaling slides showed 48 core data. So I think a 48 core beast can be expected. Having said that, they might want to hide it's bigger pal until right up to launch.:devilish:

Eagerly waiting for their SIGGRAPH presentation. Hopefully, they'll tell us much more on that day.:smile:

As a gpgpu person, I am really pleased at the texturing hardware. Lot's of (non-graphics) problems have locality of reference and if it has some cache devoted specifically to texturing then it's real good. Their other features such as address clamping, in-hardware interpolation, address wraparound often come in useful in non-graphics stuff too. I just hope these features are implemented in hardware too.

I haven't read all of today's comments yet so forgive me if I'm not the first to posit this idea, but who wants to bet that "extra" 128KB L2 per core is actually texture cache? I mean, Anand's "extra cache for other stuff" statement is far too vague to be useful.
 
If they really intend to take over the entire graphics industry, then let's see their IGPs (60% of the market?) go this route as well.

As far as I know, Intel already officially announced that the X4000-series will be the last of that family, and the next series of IGPs will be based on Larrabee.
 
When rendering to a shadow map, does it matter if you have multiple setup units running in parallel producing triangles that aren't necessarily in the correct order? I think not, since it's only depth being written to the render target.

So the next question is, is it worth adding multiple setup units to a fixed-function GPU for scenarios where colour isn't being written?

Since irregular (non-linear) rasterisation keeps coming up, will we be seeing this in D3D12? If the fixed-function guys tackle this head-on will it also be an opportunity to dump fixed-function setup? Software-based setup and rasterisation running as a "compute shader"?

Jawed
 
A TBDR-like software rasterizer makes a lot of sense on a multicore CPU as makes easier to distribute the workload over multiple cores, while keeping the working data set in L2.
Just think how inefficient software ROPs would be on a immediate mode renderer..

Working in tiles makes sense, but if there's any truth to the deferred rendering part then I'm far from convinced that is a good idea. Most games do pre-z passes these days, and it's not like the introduction of deferred renderers on the market would change that, because you need the depth for all kinds of effects anyway these days, it's pretty much an integral part of a modern game engine design. So immediate mode renderers should pretty much run at close to TBDR efficiency already on modern game engines.

Looks like people get less sceptical of Larrabee everytime Intel releases more info.
I can't wait to see the first demonstrations of their technology, which should be sometime later this year.
The theory sounds quite interesting, but how well will it work in practice?

I'm plenty sceptical still. ;) I'm not convinced this kind of architecture can beat a specialized GPU design on their home turf. There are probably areas within GPGPU where it'll rock, but I don't think it can beat AMD or Nvidia on graphics on either raw performance, performance / watt or performance / mm^2.

One also has to keep in mind that Intel wasn't first at attempting this. Sony's original plan for PS3 was for it to not have any GPU, but the SPUs would do the graphics. It ended up shipping with a traditional GPU.

Only a few brave developers will probably write their own thing from scratch.

I'm sure Tim Sweeney would be all over it the moment he gets the chance, but he's a bit odd. If Intel wants anyone to take better advantage of this chip, it better be simple and easy to plug right into an existing codebase. If they are going for the equivalent of "you can just use SOA" then this thing won't be well received with the developer community.

I'm most interested now in what their effective dynamic branching coherency is going to be. It would be really nice if it turns out to be a quad :p Though I wouldn't be surprised if it's actually 4 quads.

I don't see any reason why it couldn't be a quad. Perhaps even single pixel. Although if I'm interpreting things right, I think the most efficient way to use the cores would be to let the 16 wide array work as vec4 on a quad.
 
To write your own software renderer to make use of Larrabee, you'd need to write your own everything including a scheduler that can load balance & hide latency across tens of cores & many many threads!
If they don't provide automated tools to fiberize your code it will be a bloody fucking nightmare. Writing as close to the hardware as you would on a GPU while trying to get good efficiency is by far harder on Larrabee, because the latency hiding is done by software pipelining (essentially software multithreading) rather than handled by hardware multithreading.
 
This sounds perfect for the next Xbox, especially the lower bandwidth requirements. By 2012 they should be on 32nm with 64+ cores. Hopefully MS won't be completely nazi about owning the IP to pass up a good thing.
 
Another caveat wrt the FPS numbers..there's no indication of what graphics quality is, if they're using maximum settings. Maybe someone can clarify that at Siggraph perhaps..

The only benchmark I could find for Gears with a quick google was one for the 8800GTX 768MB - it was averaging 58 fps at the same resolution, at maximum settings, with 16xAF.
 
Back
Top