Larrabee at GDC 09

Do you know any GPU that has multiple setup units? Multiple rasterization units on the other hand should be fairly easy to implement as long as they are independently assigned to different screen tiles.
There's a Lindholm patent that clearly points out the possibility, but I'm not aware of anything else.

Beyond that, I found it intriguing to see that some NVIDIA GPUs can scale their triangle setup performance downwards through redundancy; I think there was even an IGP which could go from 1/6th per clock to 1/12th. Can't help but wonder how that works internally, assuming it's not just artificial to differentiate SKUs...

What would be very interesting is if everyone took a very different approach here; e.g. Larrabee does it as just described, NV has multiple fixed-function units, and AMD reuses the shader core in an interesting way for it. Hmm!
 
What would be very interesting is if everyone took a very different approach here; e.g. Larrabee does it as just described, NV has multiple fixed-function units, and AMD reuses the shader core in an interesting way for it. Hmm!
Do they use shader cores to perform primitive setup?
 
Do they use shader cores to perform primitive setup?

That seems to be what Intel is doing on its current IGPs:
http://software.intel.com/en-us/articles/intel-gma-3000-and-x3000-developers-guide/
Strip/Fan Unit
The functions of this stage are performed in two parts. Initially the fixed function portion of the Strip/Fan Unit is responsible for:
Applying the viewport transform to place the incoming primitive into screen space
Culling the incoming primitive if it is back facing
This stage then performs primitive setup via use of spawned setup threads to do coefficient computation and vertex attribute interpolation.

Clipping is also partly done with unified shader threads btw.
 
Last edited by a moderator:
Do they use shader cores to perform primitive setup?
Oh gosh no, I meant in the future. Clearly everyone must have considered quite a few possibilities for the DX11 generation. And yeah, Intel IGPs already do that today, although it's worth pointing out they likely don't need overly advanced scheduling/concurrency mechanisms here given the low performance level (->fewer triangles in flight at a given time, required throughput lower than the full 1/clock on discrete GPUs!)
 
No idea if that number is close to reality or not but keep in mind that we haven't seen any programmable rasterizer yet, while LRB should make practical a lot of stuff that right now only exists on graphics research papers.
Meh, anything which requires fine grained scatter/gather is still going to be a dog.

I still think their cache design was a mistake, it was of necessity for the snooping coherency ... but I would much rather they had sacrificed that (not coherency altogether, just snooping). For instance the merger of triangles previously discussed would become trivial (just have an array for each pixel, scatter into those arrays, pipeline a bit and read finished quads out of the arrays for shading). It only becomes hard because vector accesses have to be coalesced in software for efficiency's sake.
 
J. Pineda, A Parallel Algorithm for Polygon Rasterization, Computer Graphics, Vol. 22, 1988, Nr. 4, 17–20.

;)
 
:LOL: I'd forgotten about that thread.

So ... what is Swiftshader doing?...

Jawed

Haven't heard anything since TransGaming launched it, back in 2005 was it?
The TransGaming website doesn't seem updated anyway, still version 2.0, still only DX9 SM2.0 capability.
 
So ... what is Swiftshader doing?...
One of the methods we've devised is described in this patent: General Purpose Parallel Task Engine.

Note that on CPUs there's a bit more freedom than on LRB. To achieve good performance on LRB you're pretty much forced to use the 16-wide SIMD units. On a CPU a serial algorithm is sometimes still faster than a parallel algorithm. On the other hand SSE is better at handling vectors of small integers, and out of order execution can give some unexpected performance characteristics...

The public SwiftShader demo uses an all-round algorithm, but we also have an implementation better suited for 2D casual games, and one for very high polygon counts.

Like I've said before, I won't be surprised at all if LRB undergoes some major (software) updates after the hardware has been available for a while. Although there's a little less freedom, the algorithm Abrash presented is just one way of doing it. So it's important to keep that in mind when trying to evaluate things a priori. The hardware is fixed, the software not in the slightest.
 
Note that on CPUs there's a bit more freedom than on LRB.
What options (functionality?) are closed to the developer on Larrabee?

Or are you merely asserting that high-performance options on CPU are not so desirable on Larrabee, because higher performance is available through the VPU?

Jawed
 
What options (functionality?) are closed to the developer on Larrabee?

Or are you merely asserting that high-performance options on CPU are not so desirable on Larrabee, because higher performance is available through the VPU?
Well, if a parallel algorithm exists then it's highly likely that it will perform better than a scalar algorithm on LRB, indeed because of the fairly wide VPU. For things that map extremely badly to parallel algorithms, like say compilation, the VPU is pretty useless and that's of course why the legacy x86 pipeline is still a key component. But you can't really use it for any of the heavy processing.

A modern CPU on the other hand still has extremely powerful scalar pipelines. So there's no guarantee that using SSE will yield higher performance. And often a 'mixed' algorithm perform best of all.

So I wouldn't say any options are "closed" on either LRB or the CPU. They just have a different balance between resources and this affects the choice of algorithm. So knowing what works well on a CPU doesn't necessarily tell you anything useful about LRB...
 
A modern CPU on the other hand still has extremely powerful scalar pipelines. So there's no guarantee that using SSE will yield higher performance. And often a 'mixed' algorithm perform best of all.

But in the not-so-distant future CPUs will get much wider SSE units aswell, tipping the balance more towards Larrabee.

So I wouldn't say any options are "closed" on either LRB or the CPU. They just have a different balance between resources and this affects the choice of algorithm. So knowing what works well on a CPU doesn't necessarily tell you anything useful about LRB...

That's exactly what Michael Abrash's article was about. Re-thinking rasterization from the top down, to come up with something that makes better use of Larrabee than a conventional CPU approach. And as he said, this is just 'one possible' rasterizer.

I can't wait to see this stuff in action. I'd like to see what performance is like... not only in the absolute sense, but also in terms of the 'profile'... No doubt there are cases where Larrabee is far faster than a conventional GPU, and vice-versa.
 
I'd be curious to see how a code implementation would check for the the very small case in way that avoids too much overhead in the general case and avoids a branch mispredict.

As they sort the triangles anyway into tiles (for which they have to estimate the screen space area), they can probably have two or three lists of triangles per tile, pre-sorted by size (i.e. direct-rasterizable, one recursion step, two recursion steps). This would give you branch-free processing of each triangle class.
 
J. Pineda, A Parallel Algorithm for Polygon Rasterization, Computer Graphics, Vol. 22, 1988, Nr. 4, 17–20.

;)

Well, may be not so surprisingly it seems none at RAD knew about the parallel rasterization algorithms first used on Pixel Planes. Which maps perfectly with a vector or SIMD architecture. Same for other papers using the same method combined with recursive rasterization.

Incremental and hierarchical Hilbert order edge equation polygon rasterization

Or 'software' triangle setup:

Triangle scan conversion using 2D homogeneous coordinates

Akeley, likely, put a reference to the Olano and Greer paper (which pointed to the old Pineda paper) and the words 'recursive descent' on the 2001 Real-Time Graphics Architectures course (see Rasterization slide 34). So from that hint so many years ago I made the clueless decision that it should be the method used to implement the rasterizer in ATTILA. You may look for the code in emul/RasterizerEmulator.cpp (code here not in my old page, I should change the signature) but good luck understanding that code, there are two versions of the algorithm (single triangle and parallel triangle processing) and also, mixed there, a typical 'scan line' (actually scan tiled) rasterizer based on one of the papers about Compaq's Neon graphics processor to even made that file even less readable.

Of course when implemented I finally hit the wall on the long start times required for recursive rasterization. As it was just a simulator (and performance was never a problem, 1 hour per frame is really fast :) ) the solution was to 'cheat' by adding more ALUs that I would find, at that time, reasonable, parallel processing of more than one triangle and tile per cycle and using bounding boxes to select the start tile size rather than start at the framebuffer resolution. That brought the throughput of a single rasterizer using the method to a reasonable number (ATTILA has always worked by generating blocks of 8x8 fragments which at the end is also a problem for other reasons).

All the setup and rasterization process was perfectly mapeable to SIMD (you can see 4-way SIMD cover functions in the code, the idea was that they could be mapped to ARB like 4-way SIMD fragment shader instructions). In fact we presented triangle setup on a the shader processor in a paper some years ago but never got to move the rasterization code to the shader. We started working on other unrelated topics ... and the shader processor never implemented branches.

But it isn't that strange that something that was relatively well known on the graphics hardware community (actually how many people even knows here what Pixel Planes was) wasn't known in the software community. They have been making mostly serial rasterizer based for decades with good success (because they targeted serial CPUs). And I'm pretty sure they will make a way more clever implementation that the one that can be found in the ATTILA source code.

Disclaimer: As you can see this post is merely cheap self promotion of ATTILA :)
 
Well, may be not so surprisingly it seems none at RAD knew about the parallel rasterization algorithms first used on Pixel Planes. Which maps perfectly with a vector or SIMD architecture. Same for other papers using the same method combined with recursive rasterization.

Incremental and hierarchical Hilbert order edge equation polygon rasterization
That paper references Pixel Planes, so I'm pretty sure they knew about it :) They do call it out as being "expensive", although the relative tradeoffs in hardware have certainly shifted somewhat over time. Definitely can't call ignorance of the literature though ;)
 
Back
Top