Larrabee at GDC 09

Ailuros · Apr 2, 2009

bowman said:
Well, up until the point where it ceases to function in a stable manner or physically degrades, of course.

That sounds way more reasonable

Keep in mind that IHVs have to also respect the threshold specific existing and future protocols will set; an external power supply for instance would be a ridiculous idea.

crystall said:
They hired quite a few talented people so I hope that they will come up with something

It sounds hard to find something that would be as lossless as today's algorithms, be very efficient in terms of memory and bandwidth consumption and all of that being done mostly via sw.

By the way you forgot one further major headache for IMRs with MRTs + MSAA: translucency.

bowman · Apr 2, 2009

Ailuros said:
That sounds way more reasonable Keep in mind that IHVs have to also respect the threshold specific existing and future protocols will set; an external power supply for instance would be a ridiculous idea.

Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.

trinibwoy · Apr 2, 2009

bowman said:
Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.

Oh, sort of like the free reign we have with over-volting/overclocking CPUs? Yeah that would be nice to have in GPU-land too.

Nick · Apr 2, 2009

crystall said:
"Will still not match dedicated hardware peak rates per square millimeter on average"

That doesn't bode well from a performance POV because it means that when all of Larrabee ALU resources are dedicated to rasterization alone it still cannot match the RBEs of current GPUs.

I don't see how you come to that conclusion. Just because a square millimeter of a programmable chip is slower at rasterization than a square millimeter of dedicated hardware doesn't mean the entire chip is slower than the few square millimeter that a classic GPU spends on rasterization logic.

crystall · Apr 2, 2009

Ailuros said:
It sounds hard to find something that would be as lossless as today's algorithms, be very efficient in terms of memory and bandwidth consumption and all of that being done mostly via sw.

By the way you forgot one further major headache for IMRs with MRTs + MSAA: translucency.

Good point... After thinking a bit more about the problem I'm also not really sure on how Intel intends to deal with the tiles in L2. Their paper showed 3 back-end threads for each core but I doubt that 3 threads are sharing the same color/depth/MRT tiles, it would be impossible without some kind of synchronization.

MfA · Apr 3, 2009

You can tile the tile.

PS. probably being a bit too pithy there ... what I mean is you can do sort middle parallelization inside the tile, each thread would get it's own subset of quads inside the tile, so synchronization between the threads would not be an issue.

nAo · Apr 3, 2009

MfA said:
You can tile the tile.

PS. probably being a bit too pithy there ... what I mean is you can do sort middle parallelization inside the tile, each thread would get it's own subset of quads inside the tile, so synchronization between the threads would not be an issue.

Not a good idea imho, you could easily end up having only one or two threads doing any real work.

MfA · Apr 3, 2009

Easily maybe, but as long as it's not often it doesn't matter. You can get utilization almost arbitrarily high by increasing queue size.

Jawed · Apr 3, 2009

Tile size is chosen so that the target surfaces in the RTset for that tile will all fit in a core’s L2 cache. Thus an RTset with many color channels, or with large high-precision data formats, will use a smaller tile size than one with fewer or low-precision channels. To simplify the code, tiles are usually square and a power-of-two in size, typically ranging in size from 32x32 to 128x128. An application with 32-bit depth and 32-bit color can use a 128x128 tile and fill only half of the core’s 256KB L2 cache subset.

[...]

Figure 8 shows a back-end implementation that makes effective use of multiple threads that execute on a single core. A setup thread reads primitives for the tile. Next, the setup thread interpolates per-vertex parameters to find their values at each sample. Finally, the setup thread issues pixels to the work threads in groups of 16 that we call a qquad. The setup thread uses scoreboarding to ensure that qquads are not passed to the work threads until any overlapping pixels have completed processing.

The three work threads perform all remaining pixel processing, including pre-shader early Z tests, the pixel shader, regular late Z tests, and post-shader blending. Modern GPUs use dedicated logic for post-shader blending, but Larrabee uses the VPU.

[...]

One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads on each hardware thread. Each qquad’s shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing.

Jawed

crystall · Apr 3, 2009

I had forgotten the bit about scoreboarding, thanks for the pointer.

Ailuros · Apr 4, 2009

bowman said:
Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.

If I don't interpret Abrash's comments in the wrong way, it sounds to me like LRB's amount of cores as well as their frequencies being kept within specific boundries to not end up with insane power consumption values. Overclocking such a variant doesn't sound like a problem to me. As for the persentage of overclockability I guess we shouldn't forget that LRB is still a GPU and not a CPU. Meaning it'll most likely come down to how tolerant to much higher frequencies the texture co-processor (as a ff hw example) might be in the end. Of course could someone say that they might allow the driver to increase only the ALU clock domain, but on a typical GPU you hardly get nearly linear performance scaling unless you increase the frequency of most if not all parts of it.

What I had in mind all this time would had been a way higher amount of cores at way higher than currently projected frequencies, in order to reach/exceed competing future GPU's gaming performance.

spacemonkey · Apr 4, 2009

fp

What sort of sp and dp floating point performance can we expect from Larrabee? Will it be IEEE 754 compliant?

DeanoC · Apr 6, 2009

spacemonkey said:
What sort of sp and dp floating point performance can we expect from Larrabee? Will it be IEEE 754 compliant?

In theory it can do 16 single IEEE float ops per cycle per core + the FPU. As the float op can be an FMAC, thats 32 per core per cycle (we will ignore the FPU as noise).

For a 32 core larrabee, its 1024 per cycle, AKA 1 Terraflop per Ghz

DP is a bit slower

3dilettante · Apr 6, 2009

The numbers last shown would indiate DP is roughly 1/2 as fast. A bit more than a bit slower, but stilll a sight better than the 1/5 to 1/12 as fast GPU DP is stuck at.

Megadrive1988 · Apr 6, 2009

DeanoC said:
For a 32 core larrabee, its 1024 per cycle, AKA 1 Terraflop per Ghz

A 48 core Larrabee @ 2 GHz should be pretty darn impressve in SP.

trinibwoy · Apr 6, 2009

It should but is Intel really gonna get that kinda density out of the gate? Given Abrash's (perhaps meaningless) comment of above 1 teraflop I'm guessing it's going to be closer to 16 cores than it is to 48.

CouldntResist · Apr 7, 2009

http://www.pcper.com/article.php?aid=683

Forsyth also mentioned a couple of near-term features that Larrabee will offer the rasterization pipeline. They include render-target reads (shaders that can read and write the current target to enable more custom blends), demand-page texturing (ability to read from a texture when not all of it exists in memory) and order-independent translucency (allows translucent objects to be rendered just like any other surface type while the GPU does ordering for lighting effects, etc). These are features today that COULD be done with standard GPUs today but have inherent performance penalties for doing so.

This was supposed to be the new feature of DX10 (even the most important, according to Tim Sweeney, for example). More than two years have passed, and paged texture memory is still a promise of future?

MfA · Apr 8, 2009

Maybe it's political? The GPU manufacturers would rather sell cards with ever more memory ...

WDDM v2 seems to have quietly disappeared.

ShootMyMonkey · Apr 8, 2009

trinibwoy said:
It should but is Intel really gonna get that kinda density out of the gate? Given Abrash's (perhaps meaningless) comment of above 1 teraflop I'm guessing it's going to be closer to 16 cores than it is to 48.

Either that, or it's closer to 1 GHz than 2 GHz... or a little mix of less-than-expected in both. I would have figured that even for Intel, figures like 32 cores at 2 GHz seems a bit out of reach. They're not miracle-workers, and they still need to try and produce something competitive at a competitive price point. Commenting that you'll apparently see 1 TFLOP only says that they don't want to jump the gun by saying they'll get 32 cores @ 2 GHz for sure.

Besides which, I'd have to raise an eyebrow on how they plan to keep something like that fed all the time. Maybe it's just my indefatigable pessimism at work here. Both Abrash's and Forsyth's talk tended to focus on ekeing out performance at the level of individual code blocks and/or individual cores without going very deeply into things like thread scheduling. That wasn't really the topic anyway, so it wasn't really a problem for those talks. I just wish that at some point or other, we had some more hard "big picture" info. Though I think Intel doesn't really have it either.

Simon F · Apr 8, 2009

CouldntResist said:
http://www.pcper.com/article.php?aid=683

...and order-independent translucency (allows translucent objects to be rendered just like any other surface type while the GPU does ordering for lighting effects, etc). These are features today that COULD be done with standard GPUs today but have inherent performance penalties for doing so.

Click to expand...

I don't think there was any performance penalty, but order independent translucency was removed from PC PowerVR devices because virtually no developer was using it. <shrug>

Larrabee at GDC 09

Ailuros

Epsilon plus three

bowman

trinibwoy

Meh

Nick

crystall

MfA

nAo

Nutella Nutellae

MfA

Jawed

crystall

Ailuros

Epsilon plus three

spacemonkey

DeanoC

Trust me, I'm a renderer person!

3dilettante

Megadrive1988

trinibwoy

Meh

CouldntResist

MfA

ShootMyMonkey

Simon F

Tea maker

Similar threads