Larrabee at GDC 09

Ailuros · Apr 2, 2009

trinibwoy said:
I'm really curious to see how/if LRB scales downward. With all the talk of doing rasterization and triangle setup in software that's gotta slow things down tremendously when the architecture is stripped down to an entry level configuration. I'm assuming that those fixed function bits don't scale much in either direction in current GPU families.

When you say downwards do you mean something like SoC level for notebook markets or even small form factor markets like mobile/PDA/handheld?

If yes then I'd imagine that the power portofolio of the current archictecture would make it extremely difficult to think of such opportunities.

Jawed · Apr 2, 2009

trinibwoy said:
I'm really curious to see how/if LRB scales downward. With all the talk of doing rasterization and triangle setup in software that's gotta slow things down tremendously when the architecture is stripped down to an entry level configuration. I'm assuming that those fixed function bits don't scale much in either direction in current GPU families.

Slower GPUs, at lower resolution, will have less pixels to rasterise

Also their rasterisation technique, which hierarchically tile-searches for tiles that are fully enclosed by triangles (progressively smaller and smaller tiles in each iteration) - and then handles the straggly edge pixels as a "special case" once down to 4x4 areas of pixels (or MSAA samples), is highly parallelisable in the first phase. The straggly pixels/samples phase costs relatively little, so I really don't see this as a problem.

I suspect Larrabee could be some medium-sized die (200mm2 ish) that's solely speed-binned - perhaps with some memory-interface binning in the same way that NVidia bins for memory interface.

There may not be much speed-binning though as Intel repeatedly claims that they're aiming for the process-clocks sweetspot, which appears to be somewhere in the 1 to 2GHz region. I wonder if Atom clock speeds are a clue here?

Jawed

bowman · Apr 2, 2009

I'm wondering if the low clocks are a conscious choice or simply a hard limit they encountered on the first chips.

If it's just for performance per watt considerations I wonder how much headroom there is for those of us who couldn't care less about power and heat concerns.

crystall · Apr 2, 2009

Nick said:
Why? It simply rebalances itself. There are no bottlenecks, just hotspots. As long as there is very high utilization of the cores, that's more optimal than a GPU (where it's either a bottleneck or wasted silicon - often both).

That's true but from Abrash's slides I believe we'll be a bit underwhelmed by this ability, see slide 114 for example (he's talking about rasterization):

"Will still not match dedicated hardware peak rates per square millimeter on average"

That doesn't bode well from a performance POV because it means that when all of Larrabee ALU resources are dedicated to rasterization alone it still cannot match the RBEs of current GPUs.

Another thing which bugs me as far as rasterization goes is MSAA. I really believe that Larrabee should employ different AA techinques compared to the GPUs because they are not going to work very well. Take for example the Starcraft II paper as that's probably the kind of code that Larrabee will need to run.

The engine draws into 4 MRTs with four FP16 channels each, that's 32 bytes worth of data per pixel or 128 KiB for a 64x64 tile. That fits in Larrabee's L2 (256 KiB) for its default tile size. But let's say we want to do 4xMSAA, now we're talking about 512 KiB worth of data, either you have to reduce the tile size or draw to a MRT at a time in a multi-pass fashion (ouch!). Abrash's description of Larrabee rasterizer shows it working on 64x64 macro tiles made of 16 16x16 sub-tiles mad of 16 4x4 sub-sub-tiles and so on and so forth because of the 16-elements vector widths. So to fit the 4xMSAA MRTs in the L2 either you shrink the tile size to 32x32 which cannot be divided into 16 sub-tiles hence reducing the efficiency of the rasterizer (it would be working on 8x8 tiles with less chances of trivially rejecting/accepting a tile) or with much smaller 16x16 tiles which would increase the time spent binning the polygons quite a bit. Maybe considering how the rasterizer works a 64x16 tile might be a solution to this problem if they can support non-square tiles... Anyway it seems to me that MSAA cost on Larrabee is going to be higher than the 'simple' fill-rate increase one would except.

crystall · Apr 2, 2009

Jawed said:
I suspect Larrabee could be some medium-sized die (200mm2 ish) that's solely speed-binned - perhaps with some memory-interface binning in the same way that NVidia bins for memory interface.

Larrabee is going to contain an awful lot of logic so they are going to either do salvage-binning or put some kind of redundancy as yields are certainly going to be lower than their large-cache CPU dies.

Ailuros · Apr 2, 2009

bowman said:
If it's just for performance per watt considerations I wonder how much headroom there is for those of us who couldn't care less about power and heat concerns.

Without any limits?

Ailuros · Apr 2, 2009

crystall said:
Another thing which bugs me as far as rasterization goes is MSAA. I really believe that Larrabee should employ different AA techinques compared to the GPUs because they are not going to work very well.

Bears the question what kind of different AA techniques exactly?

Jawed · Apr 2, 2009

crystall said:
Larrabee is going to contain an awful lot of logic so they are going to either do salvage-binning or put some kind of redundancy as yields are certainly going to be lower than their large-cache CPU dies.

Yeah, I stupidly forgot that de-activating cores is also an easy way to bin.

Jawed

trinibwoy · Apr 2, 2009

Ailuros said:
When you say downwards do you mean something like SoC level for notebook markets or even small form factor markets like mobile/PDA/handheld?

I was actually thinking entry level discrete ~$100. Even current DirectX 10 architectures might not be the best fit for embedded platforms as the scheduling and load-balancing overhead for a relatively small number of all-purpose ALUs may be less power efficient than a more DirectX9 approach with dedicated bits.

Abrash touched on performance/mm^2 but I didn't really get a feel for how many cores it would take to match the absolute performance of a dedicated rasterizer. The premise for going with software is that you can make up for any lost performance on a given task (rasterization) by having more of the chip available to run other tasks (pixel shading etc). In general, I'm wondering whether this becomes unfeasible with smaller configurations with fewer processors as absolute rasterization performance takes a dive. And this of course doesn't just apply to LRB because it seems this is the direction everybody will take eventually.

Will still not match dedicated hardware peak
rates per square millimeter on average:
- Efficient enough, and avoids dedicating area and design effort for a narrow purpose
- Generality improves overall perf for a wide range of tasks
- For example, can bring more mm^2 to bear – the whole chip!

Jawed, you're right..there will be fewer pixels to rasterize in most cases. But unlike today where you can probably run older shader-light games at relatively high res on low-level hardware, this will become less of an option when rasterization also runs on the shaders. And there are things like tessellation to worry about too (although hopefully tessellating games allow us to scale geometry LOD as an option).

crystall · Apr 2, 2009

Ailuros said:
Bears the question what kind of different AA techniques exactly?

They hired quite a few talented people so I hope that they will come up with something

Jawed · Apr 2, 2009

crystall said:
"Will still not match dedicated hardware peak rates per square millimeter on average"

But when is peak rate relevant?

Things like generating a shadow map (with a very simple pixel shader) which use "peak rasterisation rate" will run faster on Larrabee because what else are the ALUs doing?

They're generating triangles, binning and depth-testing basically. The rasterisation rate should be very nice.

That doesn't bode well from a performance POV because it means that when all of Larrabee ALU resources are dedicated to rasterization alone it still cannot match the RBEs of current GPUs.

Even if it's "all" it's only instantaneously "all", queues and load-balancing will see to this.

Another thing which bugs me as far as rasterization goes is MSAA. I really believe that Larrabee should employ different AA techinques compared to the GPUs because they are not going to work very well. Take for example the Starcraft II paper as that's probably the kind of code that Larrabee will need to run.

D3D mandates MSAA, so as a minimum it has to be in there. Luckily for Intel they don't have to build any dedicated hardware - so less hardware engineering time and more software engineering time, meaning the chip design will have greater longevity. And things like A-buffers become trivial and don't waste any hardware that the GPUs will find is sat around doing nothing.

Though there's a reasonable chance that other GPUs will head this way in the not-too-distant future.

with much smaller 16x16 tiles which would increase the time spent binning the polygons quite a bit.

Seiler talks about bin-spread and says it amounts to 5%, so it doesn't sound like a big deal.

Jawed

Jawed · Apr 2, 2009

trinibwoy said:
I was actually thinking entry level discrete ~$100. [...] And this of course doesn't just apply to LRB because it seems this is the direction everybody will take eventually.

There's about a 10:1 performance spread between the best and worst D3D10 discrete GPUs you can buy.

Jawed, you're right..there will be fewer pixels to rasterize in most cases. But unlike today where you can probably run older shader-light games at relatively high res on low-level hardware, this will become less of an option when rasterization also runs on the shaders. And there are things like tessellation to worry about too (although hopefully tessellating games allow us to scale geometry LOD as an option).

Older games were written for an era when hardware-early-Z was in its infancy and overdraw was used dumbly with very little available texturing throughput, low rasterisation rates, no in-engine MSAA support, no HDR. Between the texturing units and rasterisation what else will Larrabee be doing

Additionally, you've seen the "performance estimates" for games like FEAR in Seiler's paper, which give the impression of easily matching something like G80.

I'm more interested in what happens with games that have very complex shaders. If a game like Grid is using a substantial part of current GPUs' ALUs (to the extent that HD4850 is about the same performance as GTX260 with ~half bandwidth), Larrabee with substantially less GFLOPs than competing GPUs for shading is going to look bad on the newest games.

The old games seem like an easy win, particularly if you can tweak the entire 3D pipeline layout to match the mechanics of the old tech in these games.

Tessellation, eventually, should help everything by keeping triangle

ixel ratios sane.

Newer rendering techniques, using compute-shader based approaches with on-die buffers and non-linear rasterisation will prolly heavily favour Larrabee because the on-die memory and the vector threading model are both extremely flexible.

Jawed

trinibwoy · Apr 2, 2009

Jawed said:
There's about a 10:1 performance spread between the best and worst D3D10 discrete GPUs you can buy.

Not sure how that bears relevance to the potential for LRB scaling?

Between the texturing units and rasterisation what else will Larrabee be doing

Well that's precisely my point. Since all its doing is rasterization it may have a perf/watt, perf/mm^2 disadvantage to older architectures with dedicated hardware.

Additionally, you've seen the "performance estimates" for games like FEAR in Seiler's paper, which give the impression of easily matching something like G80.

Heh, I'm a firm believer in empirical results. Right now at work my 3 year old production system is going up against promises and powerpoint slides. It's not a fair fight

I'm more interested in what happens with games that have very complex shaders.

Yeah that's another scenario to consider which is on the other end of the spectrum.

Newer rendering techniques, using compute-shader based approaches with on-die buffers and non-linear rasterisation will prolly heavily favour Larrabee because the on-die memory and the vector threading model are both extremely flexible.

Perhaps, but how many LRB specific rendering pipelines are we gonna get?

bowman · Apr 2, 2009

Ailuros said:
Without any limits?

Well, up until the point where it ceases to function in a stable manner or physically degrades, of course.

Jawed · Apr 2, 2009

trinibwoy said:
Not sure how that bears relevance to the potential for LRB scaling?

The range is so large that Intel may choose to focus on a sub-range.

Maybe we're being blind-sided, but the noises coming out of Intel hint that it won't compete with NVidia's $650 GPU - though there's always the multi-GPU approach...

Not forgetting, of course, that IGPs are a rising tide eating into the <$75 market.

Well that's precisely my point. Since all its doing is rasterization it may have a perf/watt, perf/mm^2 disadvantage to older architectures with dedicated hardware.

It seems Intel is turning off individual ALU lanes on predication, so power saving in Larrabee runs way way deeper than we're used to seeing with any GPU.

It'll be interesting to see which has the highest max framerates, conventional GPUs or Larrabee - it's interesting that 3DMark2001 has re-emerged as a great way to test the maximum power of the latest GPUs (though not quite as good as FurMark it seems).

Rasterisation will cost more energy in Larrabee, but z/stencil-testing, MSAA, blending etc. will cost dramatically less because these operations never go off die (only the final pixels do). So unless AMD and NVidia are about to go with tiled rendering etc., the power involved in thrashing data within an off-die render target is going to vastly outweigh rasterisation, per se.

Though it's worth pointing out that binned geometry does go off-die, so it's not completely loop-less in memory per frame/rendering-pass.

Perhaps, but how many LRB specific rendering pipelines are we gonna get?

Sorry, should have been more explicit in mentioning D3D11, since I'm suggesting that the flexibility and fluidity of Larrabee will give it an advantage on these new types of code written for D3D11, rather than Larrabee-specific code. The hardware is more like a prairie than a patchwork of fields separated by hedges, ditches, gates and grids.

Jawed

3dilettante · Apr 2, 2009

Jawed said:
It seems Intel is turning off individual ALU lanes on predication, so power saving in Larrabee runs way way deeper than we're used to seeing with any GPU.

What statements indicate that?
Turning off a lane inside of an ALU seems overly complex, particularly as the next vector mask that hits it most likely will not match the one prior.

Turning off units, much less a lane inside of one, usually involves injecting several cycles of latency getting the circuits back to an electrically stable active state.

Core2 does inactivate units opportunistically, but Intel stated this does mean the units are not useable for several cycles afterwards.

That can be hidden in an out of order core which usually has pretty low utilization, but on an in-order core's only vector unit?

trinibwoy · Apr 2, 2009

From the same Forsyth presentation I linked in the other thread:

Predication Slide said:
Usually also disables individual ALU lanes to save power

Now it's up to you to interpret what "disables" means

3dilettante · Apr 2, 2009

It would disable the register write phase of the operation for that lane, which would save power.

Maybe it's also doing something to the input side of the ALUs as well.
They most probably can't be electrically turned off. If they were kept "frozen" on some fixed input so that the circuit is electrically unchanged, the power consumption from switching transistors would be saved.

trinibwoy · Apr 2, 2009

Yeah I expect that's what the graphics guys have been doing. To be honest so far I'm not seeing where LRB is going to shine on "current" graphics workloads either in terms of efficiency or performance. It might be a steep uphill battle for Intel.

crystall · Apr 2, 2009

3dilettante said:
Turning off a lane inside of an ALU seems overly complex, particularly as the next vector mask that hits it most likely will not match the one prior.

It's inside Intel material, anyway clock-gating the ALUs independently depending on the value of the predication register is perfectly possible. It's not the same thing as turning it off (i.e. turning Vdd off) but should still provide a nice reduction in active power.

Larrabee at GDC 09

Ailuros

Epsilon plus three

Jawed

bowman

crystall

crystall

Ailuros

Epsilon plus three

Ailuros

Epsilon plus three

Jawed

trinibwoy

Meh

crystall

Jawed

Jawed

trinibwoy

Meh

bowman

Jawed

3dilettante

trinibwoy

Meh

3dilettante

trinibwoy

Meh

crystall

Similar threads