Rasterization / Scan Conversion

trinibwoy

Meh
Legend
Supporter
Arun's comment about the possibility that things like scan conversion may be decentralized in upcoming architectures made me realize that I have no idea how scan conversion is actually accelerated!

Is it based on the approach outlined in this paper - http://www.cs.unc.edu/~olano/papers/2dh-tri/2dh-tri.pdf ?

The paper references a "parallel algorithm for polygon rasterization" Pineda paper that's also referenced by this Nvidia patent.

The Olana paper outlines an algorithm that subdivides screen space and distributes those divisions to multiple SIMD processors for evaluating edge functions for pixels in parallel. But how does scan conversion work in actual hardware and is it an easy fit for execution on a programmable shader?
 
The basic property of rasterisation is that you don't need much of it. You only need to rasterise as fast as you can do early-Z rejection and colour fillrate.

Colour fillrate in GPUs is typically 16 (HD4870) or 32 (GTX280) per clock. That's not a particularly demanding workload.

For an early-Z capable GPU you can argue that a higher rasterisation rate is required, because if a number of triangles are all fully rejected (i.e. several succeeding clocks) then you could end up with clock cycles where there are no pixels, i.e. colour fillrate will go wasted.

Similarly, the fragment shading pipeline is a greedy monster, e.g. in GTX280 it can theoretically suck in 240 fragments (pixels) per clock - far far higher than colour fillrate.

So in GTX280, when a large triangle is rasterised, only the first 32 fragments are generated in the first clock. That's enough to keep one multiprocessor in one cluster happy. The others will just have to wait.

If GTX280 rasterised more quickly then you would quickly arrive at the stage where the shaded fragments leaving the clusters are arriving too fast for the ROPs to cope with (this presumes that the pixel shader is very short, i.e. runs in only a few cycles, <7.5 cycles in GTX280 - which is <16 ALU cycles when accounting for 602MHz core and 1296MHz ALU clocks).

So the actual rasterisation rate in a GPU is a compromise between the high instantaneous demand of "start-up" situations, such as a big triangle or lots of triangles being rejected (back-facing or hidden by closer objects).

So, in my view, distributing rasterisation isn't a particularly worthwhile idea if you are going to do fixed-function rasterisation. The average performance you're aiming at when building a partly fixed-function pipeline doesn't warrant it.

If, on the other hand you have a software back-end (no ROPs) and a software early-Z then a scalable, parallelisable (across multiple screen-space tiles) rasteriser is exactly what you want. It will run as fast or as slow as currently demanded by fragments in the GPU pipeline. Of course, it will run as software, too.

Larrabee :p

Jawed
 
The basic property of rasterisation is that you don't need much of it. You only need to rasterise as fast as you can do early-Z rejection and colour fillrate.

Colour fillrate in GPUs is typically 16 (HD4870) or 32 (GTX280) per clock. That's not a particularly demanding workload.

If GTX280 rasterised more quickly then you would quickly arrive at the stage where the shaded fragments leaving the clusters are arriving too fast for the ROPs to cope with (this presumes that the pixel shader is very short, i.e. runs in only a few cycles, <7.5 cycles in GTX280 - which is <16 ALU cycles when accounting for 602MHz core and 1296MHz ALU clocks).

Good point. I'd still like to know how the scan conversion process is actually implemented in current hardware though!!

So in GTX280, when a large triangle is rasterised, only the first 32 fragments are generated in the first clock. That's enough to keep one multiprocessor in one cluster happy. The others will just have to wait.

I was thinking more in terms of tesselation and/or deferred shading where there's a heavy geometry workload up front. At some point there will be enough small triangles that each doesn't produce that many pixels. So the triangle throughput would have to increase to keep the shaders and ROPs fed....

When people refer to a GPU being capable of setting up 1 triangle per clock what exactly are they referring to? I can't figure out how a GPU can rasterize an arbitrarily sized triangle in a single clock cycle. For what it's worth I could get up to 600MTri/s in Rightmark on my GTX 285 clocked at 700Mhz so that's 85% of theoretical which isn't too bad. Throughput scales perfectly with core clock and doesn't respond at all to shader clock.


Certainly Nvidia (and AMD) are aiming to blow Larrabee off the face of the planet so I wouldn't be surprised.
 
Good point. I'd still like to know how the scan conversion process is actually implemented in current hardware though!!
Just find the right patents ;) What you found is basically an hierarchical rasteriser as far as I can tell. There's other similar stuff here:

Accellerated start tile search

Tile based precision rasterization in a graphics pipeline

These seem to be related specifically to screen-space tiling, which is, implicitly, an hierarchical rasterisation problem (first work out which tile(s) the triangle covers and then break rasterisation down tile by tile).

I'm not sure how much detail you want on the process of scan conversion. As I alluded already, early-Z rejection requires a rasterisation which is likely to be separate from the rasterisation used to generate fragments. This is alluded to in the abstract here:

Rendering pipeline

which I haven't read.

Here's some other patent documents:

System, method and computer program product for geometrically transforming geometric objects

which is nice, lots of pix. This:

Method and system for a general instruction raster stage that generates programmable pixel packets

might be for handheld devices? The basic problem with this stuff is that we're forced to assemble this functionality from fragments of functionality scattered across multiple patent documents.

Some ATI stuff:

Optimized primitive filler
Optimal initial rasterization starting point

which look ancient and prolly have been optimised since. This is ATI's hierarchical tiler:

Method and apparatus for rasterizer interpolation

This seems to be rasterisation:

Rendering polygons

but it's ancient.

This is a nice overview of a real graphics pipeline:

http://ati.amd.com/products/radeonx800/RadeonX800ArchitectureWhitePaper.pdf

I was thinking more in terms of tesselation and/or deferred shading where there's a heavy geometry workload up front. At some point there will be enough small triangles that each doesn't produce that many pixels. So the triangle throughput would have to increase to keep the shaders and ROPs fed....
Yep, which is why you see occasional moans about low setup rates, as that's more of a constraint than rasterisation it seems. NVidia GPUs seemed to have a half-triangle per clock setup rate for a long time.

Again, Larrabee for the win (eventually...). The only bits of Larrabee that are going to be fixed bottlenecks are the memory and texturing systems. Everything else is open-ended based purely on workload. That's not to say it can't be wasteful (e.g. it's programmed to construct batches of fragments that are a minimum of 16 in size, but the program only packs a maximum of 4 small triangles into the batch)

When people refer to a GPU being capable of setting up 1 triangle per clock what exactly are they referring to? I can't figure out how a GPU can rasterize an arbitrarily sized triangle in a single clock cycle.
Triangle setup is essentially working out which vertices form which triangles. That proceeds, nominally, at 1 triangle per clock. A lot of the time you get one triangle per vertex coming out of the vertex shading pipeline (e.g. triangles from a triangle strip). Other times setup might be waiting for multiple vertices to be shaded to make a single triangle.

Once the setup engine has made the triangle it then commences rasterisation - but as I described earlier, it only needs to rasterise at a given rate, e.g. 16 or 32 fragments per clock. So, it doesn't matter if the triangle covers 16000 screen space pixels - it'll take its own sweet time.

Obviously the opposite problem that you've alluded to is the small triangles, particularly the little fuckers that don't even cover a whole pixel (but might still need to be rendered). This is where things get woolly as to how ATI and NVidia tackle these - particularly as it's a pixel shader batch-packing problem too. I think one of the NVidia patent documents I whistled past talks about this - but to be honest I'm not madly keen on trying to decipher all this.

Jawed
 
There's other similar stuff here:

Thanks!

Triangle setup is essentially working out which vertices form which triangles. That proceeds, nominally, at 1 triangle per clock. A lot of the time you get one triangle per vertex coming out of the vertex shading pipeline (e.g. triangles from a triangle strip). Other times setup might be waiting for multiple vertices to be shaded to make a single triangle.

Ok, that makes complete sense.

Once the setup engine has made the triangle it then commences rasterisation - but as I described earlier, it only needs to rasterise at a given rate, e.g. 16 or 32 fragments per clock. So, it doesn't matter if the triangle covers 16000 screen space pixels - it'll take its own sweet time.

But this is where I'm completely lost. If setup is 1 triangle per clock but rasterization is only 16 or 32 fragments per clock how could setup ever possibly be a bottleneck!? Doesn't the average triangle nowadays cover more than 16-32 pixels?
 
Sometimes a GPU is only rendering to depth - it is literally just generating the geometry and effectively working out what's visible. There's no pixel shading involved. So, if the GPU isn't z-rate limited in the ROPs, then it'll be setup-limited. This presumes that the vertex shaders are short'n'sweet, of course - normally they'll be simplified, e.g. to remove calculations of vertex attributes that could only be used by a pixel shader.

Jawed
 
Brain-fade alert - that last post isn't right, rasterisation to generate zixels is still required. With MSAA turned on the fillrate can be doubled or quadrupled per clock for the same rasterisation rate.

So setup rate becomes an issue mostly with small triangles and when doing an early-Z pass or rendering depth-only for shadow buffering.

Jawed
 
Back
Top