Larrabee at GDC 09

RoOoBo · May 10, 2009

Andrew Lauritzen said:
That paper references Pixel Planes, so I'm pretty sure they knew about it They do call it out as being "expensive", although the relative tradeoffs in hardware have certainly shifted somewhat over time. Definitely can't call ignorance of the literature though

Which paper? I was referring to Abrash article, at least in my understanding, the introduction clearly points to an initial lack of knowledge on parallel rasterization algorithms. Parallel or vectorized rasterization isn't 'challenging' or 'impossible' in any way if you have previous knowledge about the Pixel Planes algorithms.

In the early days of the project, as the vector width and the core architecture were constantly changing, one of the hardware architects, Eric Sprangle, kept asking us whether it might be possible to rasterize efficiently in software using vector processing. We kept patiently explaining that it wasn't

I'm not faulting Abrash or his team. None knows everything, well may be Jawed who knows all graphics patents in this world

.

In any case, why even bother on such details? The point of my post was to promote my simulator. I haven't been posting here in years and now that I'm working back on it (at least for a little while) people must remember about such wonder

.

Andrew Lauritzen · May 10, 2009

RoOoBo said:
Which paper? I was referring to Abrash article, at least in my understanding, the introduction clearly points to an initial lack of knowledge on parallel rasterization algorithms. Parallel or vectorized rasterization isn't 'challenging' or 'impossible' in any way if you have previous knowledge about the Pixel Planes algorithms.

I was referring to the paper that you linked (and I quoted your link) by McCool et al.

trinibwoy · Jul 22, 2009

I'm sure this is a silly question but how do tiling approaches like proposed on Larrabee with limited buffers (L2) scale to multiple render targets? Is the available buffer simply allocated equally to the various targets and what does this mean for scalability?

Jawed · Jul 22, 2009

In theory to generate the data for MRTs you're doing more work (mix of ALU and TU) which helps to hide the extra latencies caused by both writing more data to the MRTs and dealing with the significantly reduced number of qquads that each core can support.

In Seiler the comparison of binned rendering and immediate mode rendering shows a huge bandwidth saving. So this theoretically means Larrabee has a significant (née monstrous) leeway due to render back end running entirely out of cache.

In traditional GPUs the render target cache (colour buffer and z/stencil buffer caches) routinely thrashes, even with a single render target (though colour rate is often significantly less than max and multiple quads of pixels will be output per thrash). In Larrabee the cached-tile won't thrash, it's really functioning as a tile-buffer with only minimal latency.

Seiler talks about 32x32 being the typical smallest tile size, and explicitly talks about an RTset with many colour channels (i.e. MRTs) or high-precision formats. 32x32 is at most 64 qquads in flight. With 4 MRTs (1x 32-bit depth + 4x 32-bit colour = 20 bytes) would only use 20KB. Or 80KB if 4xMSAA. You'd have 2 or 3 threads sharing those 64 qquads plus 1 or 2 threads doing rasterisation/resolve, I suppose. Have to balance number of in-flight qquads and the per-strand state (register allocation, not forgetting allocation for moving data to/from TUs).

---

If a single render target is tiled as 128x128 per core, that consumes 128KB of L2. That's a maximum of 1024 qquads in flight - but each strand could only have 8 bytes of state in L2. So in reality there'll be substantially less qquads in flight.

So 64 qquads for the 4x MRTs tile, with an implicit 16:1 ALU:TEX (since Larrabee is serial scalar, 4:1 in vec4 terms) is still a substantial amount of latency hiding. If a modern game's MRT-generating pass is using a substantially lower ALU:TEX then that's just sucky.

Put another way, GT200 is quite happy with 32 warps per multiprocessor, equivalent to 64 qquads, at what I guess are similar clockspeeds to those we'll see in Larrabee.

Jawed

cho · Jul 22, 2009

if you are using fp16 render target, that will consumes 256KB of L2.

trinibwoy · Jul 22, 2009

Jawed said:
Seiler talks about 32x32 being the typical smallest tile size, and explicitly talks about an RTset with many colour channels (i.e. MRTs) or high-precision formats. 32x32 is at most 64 qquads in flight. With 4 MRTs (1x 32-bit depth + 4x 32-bit colour = 20 bytes) would only use 20KB. Or 80KB if 4xMSAA.

Ah thanks, guess I should've run the math to see that you can actually do a lot with 256KB of L2. Are you sure that 4 MRTs are only 20KB though? I get 80KB even without MSAA (20 KB for each).

If a single render target is tiled as 128x128 per core, that consumes 128KB of L2. That's a maximum of 1024 qquads in flight - but each strand could only have 8 bytes of state in L2. So in reality there'll be substantially less qquads in flight.

Is the Larrabee model strictly one strand per pixel or is it possible to have relatively fewer persistent qquads iterate over sub-tiles within the tile? I guess nothing is strict when it comes to LRB....

Put another way, GT200 is quite happy with 32 warps per multiprocessor, equivalent to 64 qquads, at what I guess are similar clockspeeds to those we'll see in Larrabee.

Good point. I still wonder whether it would make sense for shared memory to double as a tile buffer. State storage won't be a problem as the register file will presumably continue to exist. But I guess for that to work you'd want a bit more than the 32KB mandated by DX11.

Jawed · Jul 22, 2009

cho said:
if you are using fp16 render target, that will consumes 256KB of L2.

You'd use a smaller tile in that case. And yet smaller if doing MSAA.

Jawed

Jawed · Jul 22, 2009

trinibwoy said:
Ah thanks, guess I should've run the math to see that you can actually do a lot with 256KB of L2. Are you sure that 4 MRTs are only 20KB though? I get 80KB even without MSAA (20 KB for each).

A colour pixel in a normal render target is 4 bytes + z/stencil is 4 bytes. 4x MRTs share z/stencil, so 4*4+4=20 bytes per pixel for 32*32=1024 pixels = 20KB.

Is the Larrabee model strictly one strand per pixel or is it possible to have relatively fewer persistent qquads iterate over sub-tiles within the tile? I guess nothing is strict when it comes to LRB....

There's four (hardware) threads per core, so each thread can support multiple qquads (fibres in the general sense, i.e. could be sets of vertices, etc.). So a strand is supporting a pixel from each of numerous qquads. Each qquad is, effectively, a different region of 16 pixels in the tile. So a thread with 8 qquads in flight, say, has 8 different pixels spread across the tile being processed by a single strand.

qquads could easily iterate over the tile as you say, but each thread would normally have multiple qquads in flight at a given time, as four hardware threads, in their own right, aren't enough to hide most memory/texture latencies.

Good point. I still wonder whether it would make sense for shared memory to double as a tile buffer. State storage won't be a problem as the register file will presumably continue to exist. But I guess for that to work you'd want a bit more than the 32KB mandated by DX11.

I think shared memory, under graphics pipeline configuration (i.e. VS-GS-PS etc.), is used to hold triangle attributes ready for just-in-time interpolation by the pixel shader.

But more importantly, Larrabee bins geometry into screen-space tiles. Seiler describes a tiled forward renderer as Larrabee doesn't attempt to Z-sort/cull triangles like a tiled deferred renderer does, instead relying upon a Z-buffer and un-ordered triangle rasterisation/pixel-shading.

So once pixel shading of a tile is started it doesn't stop until all triangles in that tile have been rasterised and shaded. ATI and NVidia GPUs don't do any such binning, so putting a tile into shared memory would only last a short while before it's evicted for another tile. This exact process (thrashing tiles from the render target into on-die cache) is done by the ROPs. ATI functional diagrams contain blocks called "colour buffer cache" and "z/stencil buffer cache". Each of these is an independent cache dedicated to the named buffer.

Jawed

nAo · Jul 22, 2009

Keep in mind that as long as you can prefetch your frame buffer data early enough from memory to L2 then you can have almost arbitrarily big tiles. Obviously this is not an optimal solution as it requires to use more memory bandwidth.

trinibwoy · Jul 24, 2009

Jawed said:
A colour pixel in a normal render target is 4 bytes + z/stencil is 4 bytes. 4x MRTs share z/stencil, so 4*4+4=20 bytes per pixel for 32*32=1024 pixels = 20KB.

Yes, of course. Excuse my fuzzy math

Megadrive1988 · Jul 30, 2009

Gamasutra's feature interview with Intel's Mike Burrows on Larrabee:

http://www.gamasutra.com/php-bin/news_index.php?story=24638

page 1
page 2
page 3

MfA · Jul 30, 2009

That didn't have a very high information content.

repi · Jul 31, 2009

Mike is a great guy though!

Larrabee at GDC 09

RoOoBo

Andrew Lauritzen

Moderator

trinibwoy

Meh

Jawed

cho

trinibwoy

Meh

Jawed

Jawed

nAo

Nutella Nutellae

trinibwoy

Meh

Megadrive1988

MfA

repi

Similar threads