Larrabee at GDC 09

Well, up until the point where it ceases to function in a stable manner or physically degrades, of course.

That sounds way more reasonable ;) Keep in mind that IHVs have to also respect the threshold specific existing and future protocols will set; an external power supply for instance would be a ridiculous idea.

They hired quite a few talented people so I hope that they will come up with something ;)

It sounds hard to find something that would be as lossless as today's algorithms, be very efficient in terms of memory and bandwidth consumption and all of that being done mostly via sw.

By the way you forgot one further major headache for IMRs with MRTs + MSAA: translucency.
 
Last edited by a moderator:
That sounds way more reasonable ;) Keep in mind that IHVs have to also respect the threshold specific existing and future protocols will set; an external power supply for instance would be a ridiculous idea.

Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.
 
Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.

Oh, sort of like the free reign we have with over-volting/overclocking CPUs? Yeah that would be nice to have in GPU-land too.
 
"Will still not match dedicated hardware peak rates per square millimeter on average"

That doesn't bode well from a performance POV because it means that when all of Larrabee ALU resources are dedicated to rasterization alone it still cannot match the RBEs of current GPUs.
I don't see how you come to that conclusion. Just because a square millimeter of a programmable chip is slower at rasterization than a square millimeter of dedicated hardware doesn't mean the entire chip is slower than the few square millimeter that a classic GPU spends on rasterization logic.
 
It sounds hard to find something that would be as lossless as today's algorithms, be very efficient in terms of memory and bandwidth consumption and all of that being done mostly via sw.

By the way you forgot one further major headache for IMRs with MRTs + MSAA: translucency.
Good point... After thinking a bit more about the problem I'm also not really sure on how Intel intends to deal with the tiles in L2. Their paper showed 3 back-end threads for each core but I doubt that 3 threads are sharing the same color/depth/MRT tiles, it would be impossible without some kind of synchronization.
 
You can tile the tile.

PS. probably being a bit too pithy there ... what I mean is you can do sort middle parallelization inside the tile, each thread would get it's own subset of quads inside the tile, so synchronization between the threads would not be an issue.
 
Last edited by a moderator:
You can tile the tile.

PS. probably being a bit too pithy there ... what I mean is you can do sort middle parallelization inside the tile, each thread would get it's own subset of quads inside the tile, so synchronization between the threads would not be an issue.
Not a good idea imho, you could easily end up having only one or two threads doing any real work.
 
Easily maybe, but as long as it's not often it doesn't matter. You can get utilization almost arbitrarily high by increasing queue size.
 
Tile size is chosen so that the target surfaces in the RTset for that tile will all fit in a core’s L2 cache. Thus an RTset with many color channels, or with large high-precision data formats, will use a smaller tile size than one with fewer or low-precision channels. To simplify the code, tiles are usually square and a power-of-two in size, typically ranging in size from 32x32 to 128x128. An application with 32-bit depth and 32-bit color can use a 128x128 tile and fill only half of the core’s 256KB L2 cache subset.

[...]

Figure 8 shows a back-end implementation that makes effective use of multiple threads that execute on a single core. A setup thread reads primitives for the tile. Next, the setup thread interpolates per-vertex parameters to find their values at each sample. Finally, the setup thread issues pixels to the work threads in groups of 16 that we call a qquad. The setup thread uses scoreboarding to ensure that qquads are not passed to the work threads until any overlapping pixels have completed processing.

The three work threads perform all remaining pixel processing, including pre-shader early Z tests, the pixel shader, regular late Z tests, and post-shader blending. Modern GPUs use dedicated logic for post-shader blending, but Larrabee uses the VPU.

[...]

One remaining issue is texture co-processor accesses, which can have hundreds of clocks of latency. This is hidden by computing multiple qquads on each hardware thread. Each qquad’s shader is called a fiber. The different fibers on a thread co-operatively switch between themselves without any OS intervention. A fiber switch is performed after each texture read command, and processing passes to the other fibers running on the thread. Fibers execute in a circular queue. The number of fibers is chosen so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing.
Jawed
 
Well, I was thinking more of a DIY overclocking take on things. Current graphics cards don't have much headroom, not without a bit of soldering anyway.

If I don't interpret Abrash's comments in the wrong way, it sounds to me like LRB's amount of cores as well as their frequencies being kept within specific boundries to not end up with insane power consumption values. Overclocking such a variant doesn't sound like a problem to me. As for the persentage of overclockability I guess we shouldn't forget that LRB is still a GPU and not a CPU. Meaning it'll most likely come down to how tolerant to much higher frequencies the texture co-processor (as a ff hw example) might be in the end. Of course could someone say that they might allow the driver to increase only the ALU clock domain, but on a typical GPU you hardly get nearly linear performance scaling unless you increase the frequency of most if not all parts of it.

What I had in mind all this time would had been a way higher amount of cores at way higher than currently projected frequencies, in order to reach/exceed competing future GPU's gaming performance.
 
What sort of sp and dp floating point performance can we expect from Larrabee? Will it be IEEE 754 compliant?

In theory it can do 16 single IEEE float ops per cycle per core + the FPU. As the float op can be an FMAC, thats 32 per core per cycle (we will ignore the FPU as noise).

For a 32 core larrabee, its 1024 per cycle, AKA 1 Terraflop per Ghz

DP is a bit slower
 
The numbers last shown would indiate DP is roughly 1/2 as fast. A bit more than a bit slower, but stilll a sight better than the 1/5 to 1/12 as fast GPU DP is stuck at.
 
It should but is Intel really gonna get that kinda density out of the gate? Given Abrash's (perhaps meaningless) comment of above 1 teraflop I'm guessing it's going to be closer to 16 cores than it is to 48.
 
http://www.pcper.com/article.php?aid=683

Forsyth also mentioned a couple of near-term features that Larrabee will offer the rasterization pipeline. They include render-target reads (shaders that can read and write the current target to enable more custom blends), demand-page texturing (ability to read from a texture when not all of it exists in memory) and order-independent translucency (allows translucent objects to be rendered just like any other surface type while the GPU does ordering for lighting effects, etc). These are features today that COULD be done with standard GPUs today but have inherent performance penalties for doing so.

This was supposed to be the new feature of DX10 (even the most important, according to Tim Sweeney, for example). More than two years have passed, and paged texture memory is still a promise of future?
 
Maybe it's political? The GPU manufacturers would rather sell cards with ever more memory ...

WDDM v2 seems to have quietly disappeared.
 
It should but is Intel really gonna get that kinda density out of the gate? Given Abrash's (perhaps meaningless) comment of above 1 teraflop I'm guessing it's going to be closer to 16 cores than it is to 48.
Either that, or it's closer to 1 GHz than 2 GHz... or a little mix of less-than-expected in both. I would have figured that even for Intel, figures like 32 cores at 2 GHz seems a bit out of reach. They're not miracle-workers, and they still need to try and produce something competitive at a competitive price point. Commenting that you'll apparently see 1 TFLOP only says that they don't want to jump the gun by saying they'll get 32 cores @ 2 GHz for sure.

Besides which, I'd have to raise an eyebrow on how they plan to keep something like that fed all the time. Maybe it's just my indefatigable pessimism at work here. Both Abrash's and Forsyth's talk tended to focus on ekeing out performance at the level of individual code blocks and/or individual cores without going very deeply into things like thread scheduling. That wasn't really the topic anyway, so it wasn't really a problem for those talks. I just wish that at some point or other, we had some more hard "big picture" info. Though I think Intel doesn't really have it either.
 
http://www.pcper.com/article.php?aid=683

...and order-independent translucency (allows translucent objects to be rendered just like any other surface type while the GPU does ordering for lighting effects, etc). These are features today that COULD be done with standard GPUs today but have inherent performance penalties for doing so.
I don't think there was any performance penalty, but order independent translucency was removed from PC PowerVR devices because virtually no developer was using it. <shrug>
 
Back
Top