Larrabee at Siggraph

Intel's strand == Nvidia's thread, I'm sure Aaron is happy now :)

Meh, at least Intel gave a full breakdown of the terminology and are basing it off the actual hardware instead of some random terminology with no basis in their hardware.

But I thought it was Intel Fiber = Nvidia thread.
 
I think looking at the chart, if it's accurate, performance should go above 60fps with 25 1Ghz cores. Some of there sample frames need only 5 cores, for example, to be done in 1/60th of a second...for most of those sample frames the numbers required are far less than 25.

So maybe it's more accurate to say that 25 is needed to prevent it dipping below 60fps.

If I'm understanding it right..

Hmm, I thot that's what I said. :LOL:

But yeah, if you're willing to let performance having momentary dips below 60fps then your average fps would be much higher.

But then maybe with all their flexibility they'd go off and do something else with those cores just then. :???:
 
Well, unless there's a prob with it, here's my rough ruler-based estimate of the number of 1Ghz larabee cores needed for each of the 25 sample frames for Gears of War, to render them at 60fps:

23
13
12
23
12-13
12
12
17
17
22
22
15
12
14
15
24
20
23
14
15
17
17
16
17
15

Sorry it's not precise, I didn't want to start guesstimating decimal points etc. Based on my numbers, for this set of 25 sample frames, some stats:

Average: 16.78
Min: 12
Max: 24
Median: 16

Hmm, I thot that's what I said. :LOL:

Soz, yeah, for some reason I read maximum for minimum in your post! :|
 
Interesting Titanio :)
But I guess you forgot to mention the resolution, no?

EDIT
Thanks Geo, between I'm stupid... I guess Titanio based his figure on the FEAR results (@1600x1200)... :oops: I should have think twice before ask stupid things.
 
Of course, it is worth noting these are 1GHz cores. . . and most people seem to be expecting north of 2GHz by the time it is shipping in retail.
 
F.E.A.R. 1600x1200 4x AA (I presume that's what 4 samples means).. again, estimates based on the chart:

9.5
11.5
14
14
16.5
19
16
13
14
12
15
15.5
15
14.5
7
16
12.5
10.5
8
9
8
7
26
15
14

(I used 0.5s here..)

Average: 13.3
Min: 7
Max: 26
Median: 14
 
Well, unless there's a prob with it, here's my rough ruler-based estimate of the number of 1Ghz larabee cores needed for each of the 25 sample frames for Gears of War, to render them at 60fps:

23
13
12
23
12-13
12
12
17
17
22
22
15
12
14
15
24
20
23
14
15
17
17
16
17
15

Sorry it's not precise, I didn't want to start guesstimating decimal points etc. Based on my numbers, for this set of 25 sample frames, some stats:

Average: 16.78
Min: 12
Max: 24
Median: 16



Soz, yeah, for some reason I read maximum for minimum in your post! :|

How did you calculate that?:?:
 
Figure 10 in the paper has a data-points for each game's sample frames, showing how many 1Ghz cores would be required to render that frame at 1/60 of a second. The y-axis label jumps in increments of 5 cores, though, so I'm having to eyeball some of the numbers.

I don't think I'll do HL2..it's distribution is a bit less all over the place, ranges between maybe 6 and 10..
 
Of course, it is worth noting these are 1GHz cores. . . and most people seem to be expecting north of 2GHz by the time it is shipping in retail.

Given that Intel is confidently going to churn out a 700+ million transistor 45nm die (Nehalem) this year, that GPUs are seemingly always bigger than CPUs, and that any unsatisfying cores can be fused off the die and then sold off as a different SKU.. I think there will be a lot more than 32 cores on this thing. It reminds me of the tactic AMD used to make everyone think RV770 would be mildly more powerful than the RV670. :cool:
 
Based on this limited data set of simulated results, anyone care to hazard a guess how it might compare to some GPUs we know of today? :LOL:

Say, a 32-core 2Ghz variant? Assuming linear scaling with clock-speed and core numbers? (Intel claims near linear scaling with cores in these games, at least..)
 
Since I don't have a subscription, I have to ask: do they indicate how they pick their sample frames? Is this running on some kind of Larrabee simulator?

I guess it's too early for a histogram or somesuch, but the hairs on the back of my neck raise up whenever I see the prospect of "selected" frames.
 
It's running the same simulator they use to simulate CPU archs under development. I don't know how kosher this is, but here goes.

We captured the frames by intercepting the DirectX 9 command
stream being sent to a conventional graphics card while the game
was played at a normal speed, along with the contents of textures
and surfaces at the start of the frame. We tested them through a
functional model to ensure the algorithms were correct and that
the right images were produced. Next, we estimated the cost of
each section of code in the functional model, being aggressively
pessimistic, and built a rough profile of each frame. We wrote
assembly code for the highest-cost sections, ran it through cycle-
accurate simulators, fed the clock cycle results back into the
functional model, and re-ran the traces. This iterative cycle of
refinement was repeated until 90% of the clock cycles executed
during a frame had been run through the simulators, giving the
overall profiles a high degree of confidence. Texture unit
throughput, cache performance and memory bandwidth
limitations were all included in the various simulations.
 
Since I don't have a subscription, I have to ask: do they indicate how they pick their sample frames? Is this running on some kind of Larrabee simulator?

I guess it's too early for a histogram or somesuch, but the hairs on the back of my neck raise up whenever I see the prospect of "selected" frames.

For HL2 they seem to say they took 1 in every 30. For FEAR, 1 in every 100 and for Gears, 1 in every 250. They say "frames are widely separated to catch different scene characteristics as the games progress". They also say they captured frames "while the game was played at normal speed".
 
I'm only flicking on in the paper now, and I spy some data on real time ray tracing with Larrabee. The implementation is c++ with some assembly in key places.

Shows a screenshot from a 1024x1024 frame with 234k triangles, 1 light source, 1 reflection level, and "typically" 4m rays per frame.

They have a performance comparison for 1Ghz Larrabee vs a 8-core Xeon 2.6Ghz for varying numbers of Larrabee cores. It's not clear if this is for the same scene shown in the screenshot or not.

But anyway, the Xeon gets something between 10 and 15fps. A 8-core larrabee gets 21.92fps. 16 cores gets 41.16 fps. 32 gets 71.63fps.

They observe that a core duo 2 requires 4.67x more clock cycles than a single Larrabee core for this workload.

edit - sorry, the numbers in the chart are for the scene shown in the screenshot.
 
The way Larrabee's needed core numbers are derived makes it difficult (impossible?) to compare to any current GPU numbers, which are not tested in like manner.

GT200 for FEAR at 1600x1200 with 4x AA had what, a 90-100 FPS average?
It doesn't do a good job capturing a minimum (or what fraction of resources yield a given minimum), but I don't think the Larrabee measurments do, either.
 
No, they don't..it's quite plausible the tiny set of frames used here doesn't include the most demanding frame you might come across in a comparable benchmark for the game..or even that it's necessarily representative of what's typical (although intel would probably say it is)...

..but..

for this tiny set of frames, assuming linear clock and core scaling, if their simulation is accurate.. a 48-core 2Ghz larabee would render that set of frames at a minimum of ~240fps.
 
Here's what they say about their benching:

We captured the frames by intercepting the DirectX 9 command
stream being sent to a conventional graphics card while the game
was played at a normal speed, along with the contents of textures
and surfaces at the start of the frame. We tested them through a
functional model to ensure the algorithms were correct and that
the right images were produced. Next, we estimated the cost of
each section of code in the functional model, being aggressively
pessimistic, and built a rough profile of each frame. We wrote
assembly code for the highest-cost sections, ran it through cycleaccurate
simulators, fed the clock cycle results back into the
functional model, and re-ran the traces. This iterative cycle of
refinement was repeated until 90% of the clock cycles executed
during a frame had been run through the simulators, giving the
overall profiles a high degree of confidence. Texture unit
throughput, cache performance and memory bandwidth
limitations were all included in the various simulations.

I like the "aggressively pessimistic" part, but I'm not sure about how reliable the rest is. :smile:
 
The paper is quite interesting, and it's impressive what they can do with sofware rendering.

It's a shame that they don't have any information about storage requirements for binning. I guess we can infer that they must be less than half the BW per frame. Their method of rasterizing during binning is something I never thought of before because a 64x64 tile w/ 4xAA would potentially need a coverage mask of 16 kbits (!) per triangle, but I guess the subdivision-based rasterization would produce a more efficient coverage mask. Still seems like quite a bit of space per triangle, though, and you have a bunch of post-VS attributes to store per vertex, too.

There's an interesting blurb in there about rasterization:
Rasterization is unquestionably more efficient in dedicated logic
than in software when running at peak rates, but using dedicated
logic has drawbacks for Larrabee. In a modern GPU, the rasterizer
is a fine-grain serialization point: all primitives are put back in
order before rasterization. Scaling the renderer over large numbers
of cores requires eliminating all but the most coarse-grained
serialization points. The rasterizer could be designed to allow
multiple cores to send it primitives out of order, but this would
impose a significant communication expense and would require
software to manage contention for the rasterizer resource. A
software rasterizer avoids these costs. It also allows rasterization
to be parallelized over many cores or moved to multiple different
places in the rendering pipeline.
I guess increasing setup speed in GPUs beyond 1 per clock is indeed a tough task, as dealing with fragments from multiple rasterizers would be tough. I have a feeling this is one reason that Larabee is doing well, as it can have each core work on a seperate set of primitives (they divide it into chunks of 1,000). They even have a different subsection in each bin to store the results from each core.

I hope ATI/NVidia tackle this issue. I think it's doable.
 
Back
Top