efficient software rasterization

3dcgi · Jul 8, 2008

Does anyone have any links to help me learn how an efficient software rasterizer would work?

I've already looked around a bit and read a couple articles, but I figure someone might have some good links saved.

I'm not looking to write a software rasterizer or become an expert, just to understand how an efficient one works.

I'm looking for things like instruction mix and is data processed in small chunks so it's kept in cache throughout the rendering process or is data spilled to memory between stages (i.e. vertex processing -> rasterization and rasterization -> shading).

Thanks in advance for any help.

corysama · Jul 8, 2008

The classic text on software rasterizers is Chris Hecker's set of articles:
http://chrishecker.com/Miscellaneous_Technical_Articles#Perspective_Texture_Mapping

For a more modern, SIMD approach, the only reference I am aware of is Nick Capen's description of half-space rasterization:
http://www.devmaster.net/forums/showthread.php?t=1884

Does anyone have any other references for modern software rasterization? The scene has been sadly quiet for the past decade or so...

Davros · Jul 8, 2008

heres a good article about rasterization

http://www.devmaster.net/forums/showthread.php?t=1884

check out this rateriser complet with sourcecode
http://www.trenki.net/content/view/18/38/

Nick · Jul 8, 2008

3dcgi said:
I'm looking for things like instruction mix...

What exactly do you need to know about it?

...is data processed in small chunks so it's kept in cache throughout the rendering process or is data spilled to memory between stages (i.e. vertex processing -> rasterization and rasterization -> shading).

ExtremeTech has a page on SwiftShader 2.0's architecture. Basically yes it will try to keep things in cache as much as possible.

3dcgi · Jul 8, 2008

Nick said:
What exactly do you need to know about it?

I was thinking about the percentage of scalar vs vector instructions. This question came about from past theorizing that Larrabee will have separate scalar and vector units with hyperthreading. Which got me wondering if rasterization would be able to run fast if it's mostly on the scalar unit while shading is vectorized.

Nick said:
Basically yes it will try to keep things in cache as much as possible.

Is Swiftshader aware how much cache is available for the processor it's running on or does it assume a certain amount of memory is available.

Thanks.

nAo · Jul 8, 2008

3dcgi said:
Does anyone have any links to help me learn how an efficient software rasterizer would work?

It's hard to define 'efficient' (what hardware are you working on?) but to not get crazy trying to implement some robust clipper (ain't easy, believe me

) I'd go for homogeneous rasterization.

It's relatively simple to implement and you can easily clip you triangles (and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake. Moreover it maps reasonably well to modern CPUs and GPUs (especially CELL..)

killerbobjr · Jul 8, 2008

Here's another software rasterizer, complete with source code:

http://hatchbackjack.com/?page_id=4

It implements a subset of OpenGL functions and seems to be fairly efficient.

Nick · Jul 8, 2008

3dcgi said:
I was thinking about the percentage of scalar vs vector instructions. This question came about from past theorizing that Larrabee will have separate scalar and vector units with hyperthreading. Which got me wondering if rasterization would be able to run fast if it's mostly on the scalar unit while shading is vectorized.

Why would rasterization be less vectorized than shading?

Is Swiftshader aware how much cache is available for the processor it's running on or does it assume a certain amount of memory is available.

The public demo is more or less tuned for Core 2. But it's not like performance decimates if you have less cache; Phenom is doing ok and appears to be limited more by other things than cache.

Things might be quite different for Larrabee though. Given the bandwith and arithmetic throughput it might try hard to keep certain data in cache while other data is allowed to spill to RAM.

As the software evolves I expect to see quite interesting changes in performance characteristics.

Nick · Jul 8, 2008

nAo said:
...but to not get crazy trying to implement some robust clipper (ain't easy, believe me ) I'd go for homogeneous rasterization

Could you elaborate on the robustness issues you encountered when implementing a clipper? I find it quite straightforward.

...(and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake.

That's no issue for a geometrical clipper either.

nAo · Jul 8, 2008

Nick said:
Could you elaborate on the robustness issues you encountered when implementing a clipper? I find it quite straightforward

On PS2 I worked on fast (clip space) clipper and I had robustness issues due to the order certain floating point operations were done (clipping the edge A->B and clipping B->A didn't have the same outcome) and also with T junctions (Unfortunately I had to deal with that) when the longest partially shared edge was being clipped and the shortest one was not.

nAo · Jul 8, 2008

Nick said:
That's no issue for a geometrical clipper either.

Correct. Though with homogeneous rasterization clipping is a 'natural' operation, you basically don't need to write and mantain any extra code to handle it, which is neat.

Davros · Jul 8, 2008

3dcgi said:
I'm looking for things like instruction mix and is data processed in small chunks so it's kept in cache throughout the rendering process

from the article i linked to

"It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines."

TimothyFarrar · Jul 8, 2008

Funny thing with software rasterization, is at the point where triangles get pixel sized, all this clipping and such becomes trivial, and effectively all you need is a really fast z buffered scatter (point drawing) and some post processing to clean up the artifacts (assuming you've got the LOD problem solved, which is easy IMO).

Nick · Jul 8, 2008

nAo said:
clipping the edge A->B and clipping B->A didn't have the same outcome

That's easy enough to solve by always clipping in the same direction (from the vertex at the backside of the plane to the one on the front side, or reverse).

nAo · Jul 8, 2008

Nick said:
That's easy enough to solve by always clipping in the same direction (from the vertex at the backside of the plane to the one on the front side, or reverse).

Yep, this is what I did to fix it.

3dcgi · Jul 9, 2008

nAo said:
It's hard to define 'efficient' (what hardware are you working on?) but to not get crazy trying to implement some robust clipper (ain't easy, believe me ) I'd go for homogeneous rasterization.

It's relatively simple to implement and you can easily clip you triangles (and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake. Moreover it maps reasonably well to modern CPUs and GPUs (especially CELL..)

Thanks for the links everyone. I'm not actually writing a software rasterizer, just curious. By efficient I really just meant a rasterizer that's written with some thought to performance and not one that is just written to work. The few articles I read a while back didn't focus much on performance, just functionality.

3dcgi · Jul 9, 2008

Nick said:
Why would rasterization be less vectorized than shading?

I don't know, but during past Larrabee discussions some people thought the rasterization threads would run on the scalar ALUs while shading runs on the vector units. Presumably they were thinking rasterization would require fewer FLOPs than shading. Which seems to make more and more sense as time goes on.

Prior to those discussions I hadn't envisioned Larrabee working in this way so it got me thinking about how a software rasterizer works and how it would be implemented on Larrabee or GPUs. One thought was does it make sense to even have separate scalar units if rasterization is amenable to vector (SIMD) processing.

As I'm sure you know GPUs typically rasterize blocks of pixels and shading happens in parallel on distinct hardware. With a single threaded software rasterizer I envision it needs to rasterize a number of pixels, store the data in cache, shade the pixels/fragments, and perform the other operations like depth tests and blending. Repeat for other groups of pixels.

Other options I thought of are performing each step distinctly and storing the data in ram between passes. And finally using multiple threads to keep each process running somewhat in parallel. This requires synchronization mechanisms between stages.

3dcgi · Jul 9, 2008

corysama said:
For a more modern, SIMD approach, the only reference I am aware of is Nick Capen's description of half-space rasterization:
http://www.devmaster.net/forums/showthread.php?t=1884

Ah so that's Nick's last name. I thought he was like Bono and didn't need a second name.

I'll definitely check out that post.

Nick · Jul 9, 2008

3dcgi said:
I don't know, but during past Larrabee discussions some people thought the rasterization threads would run on the scalar ALUs while shading runs on the vector units. Presumably they were thinking rasterization would require fewer FLOPs than shading.

Prior to those discussions I hadn't envisioned Larrabee working in this way so it got me thinking about how a software rasterizer works and how it would be implemented on Larrabee or GPUs. One thought was does it make sense to even have separate scalar units if rasterization is amenable to vector (SIMD) processing.

Rasterization indeed only takes a fraction of computing power compared to shading, but I don't think it makes sense to have scalar units just for this. My expectation is that it just runs as separately scheduled tasks using the entire core including SIMD whenever possible. However, there's definitely a generic need for scalar integer handling. I expect that to be a separate instrution port with SMP support.

As I'm sure you know GPUs typically rasterize blocks of pixels and shading happens in parallel on distinct hardware. With a single threaded software rasterizer I envision it needs to rasterize a number of pixels, store the data in cache, shade the pixels/fragments, and perform the other operations like depth tests and blending. Repeat for other groups of pixels.

Other options I thought of are performing each step distinctly and storing the data in ram between passes. And finally using multiple threads to keep each process running somewhat in parallel. This requires synchronization mechanisms between stages.

Yes those are the general approaches. The interesting thing about software rendering is that you don't have to pin yourself down to one approach. There are many different ways to horizontally and vertically split up the graphics pipeline. Also note that the cache hierarchy automatically adapts to the access pattern. During the same task you might have things that stick around in the cache while other data spills to RAM.

So I don't think we can say much about Larrabee at this time. And as I've mentioned before the approach can change dramatically from one driver generation to the next. The software could even adapt itself to the application's behavior...

3dcgi said:
corysama said:

For a more modern, SIMD approach, the only reference I am aware of is Nick Capen's description of half-space rasterization

Click to expand...

Ah so that's Nick's last name. I thought he was like Bono and didn't need a second name.

Actually it's Capens with the 's' attached, and my first name is Nicolas in real life. But I have to start considering to officially change it to just Nick.

Scali · Aug 19, 2008

The most efficient rasterizer depends on what it is that you want to rasterize.
The only post that relates to this so far seems to be the one on pixel-sized triangles.
Indeed, triangle size can be of influence for the strategy to take when rendering. Also the expected overdraw. Nearly all 3d hardware uses a z-buffer, but in some cases it's better to eliminate overdraw on a higher level than per-pixel.

Another point is the desired quality and performance, and ofcourse the properties of the target platform.
If I were to write a rasterizer for a Gameboy Advance, I'd take a completely different strategy from writing one for a regular desktop PC, for example.
It all depends on what the hardware is good at.
Technically, you start with designing the content to suit the hardware.

efficient software rasterization

3dcgi

corysama

Davros

Nick

3dcgi

nAo

Nutella Nutellae

killerbobjr

Nick

Nick

nAo

Nutella Nutellae

nAo

Nutella Nutellae

Davros

TimothyFarrar

Nick

nAo

Nutella Nutellae

3dcgi

3dcgi

3dcgi

Nick

Scali

Similar threads