efficient software rasterization

3dcgi

Veteran
Supporter
Does anyone have any links to help me learn how an efficient software rasterizer would work?

I've already looked around a bit and read a couple articles, but I figure someone might have some good links saved.

I'm not looking to write a software rasterizer or become an expert, just to understand how an efficient one works.

I'm looking for things like instruction mix and is data processed in small chunks so it's kept in cache throughout the rendering process or is data spilled to memory between stages (i.e. vertex processing -> rasterization and rasterization -> shading).

Thanks in advance for any help.
 
I'm looking for things like instruction mix...
What exactly do you need to know about it?
...is data processed in small chunks so it's kept in cache throughout the rendering process or is data spilled to memory between stages (i.e. vertex processing -> rasterization and rasterization -> shading).
ExtremeTech has a page on SwiftShader 2.0's architecture. Basically yes it will try to keep things in cache as much as possible.
 
What exactly do you need to know about it?
I was thinking about the percentage of scalar vs vector instructions. This question came about from past theorizing that Larrabee will have separate scalar and vector units with hyperthreading. Which got me wondering if rasterization would be able to run fast if it's mostly on the scalar unit while shading is vectorized.

Basically yes it will try to keep things in cache as much as possible.
Is Swiftshader aware how much cache is available for the processor it's running on or does it assume a certain amount of memory is available.

Thanks.
 
Does anyone have any links to help me learn how an efficient software rasterizer would work?
It's hard to define 'efficient' (what hardware are you working on?) but to not get crazy trying to implement some robust clipper (ain't easy, believe me :) ) I'd go for homogeneous rasterization.

It's relatively simple to implement and you can easily clip you triangles (and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake. Moreover it maps reasonably well to modern CPUs and GPUs (especially CELL..)
 
I was thinking about the percentage of scalar vs vector instructions. This question came about from past theorizing that Larrabee will have separate scalar and vector units with hyperthreading. Which got me wondering if rasterization would be able to run fast if it's mostly on the scalar unit while shading is vectorized.
Why would rasterization be less vectorized than shading?
Is Swiftshader aware how much cache is available for the processor it's running on or does it assume a certain amount of memory is available.
The public demo is more or less tuned for Core 2. But it's not like performance decimates if you have less cache; Phenom is doing ok and appears to be limited more by other things than cache.

Things might be quite different for Larrabee though. Given the bandwith and arithmetic throughput it might try hard to keep certain data in cache while other data is allowed to spill to RAM.

As the software evolves I expect to see quite interesting changes in performance characteristics.
 
...but to not get crazy trying to implement some robust clipper (ain't easy, believe me :) ) I'd go for homogeneous rasterization
Could you elaborate on the robustness issues you encountered when implementing a clipper? I find it quite straightforward.
...(and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake.
That's no issue for a geometrical clipper either.
 
Could you elaborate on the robustness issues you encountered when implementing a clipper? I find it quite straightforward
On PS2 I worked on fast (clip space) clipper and I had robustness issues due to the order certain floating point operations were done (clipping the edge A->B and clipping B->A didn't have the same outcome) and also with T junctions (Unfortunately I had to deal with that) when the longest partially shared edge was being clipped and the shortest one was not.
 
That's no issue for a geometrical clipper either.
Correct. Though with homogeneous rasterization clipping is a 'natural' operation, you basically don't need to write and mantain any extra code to handle it, which is neat.
 
I'm looking for things like instruction mix and is data processed in small chunks so it's kept in cache throughout the rendering process

from the article i linked to

"It is advanced in the sense that it has many nice properties the classical scanline conversion algorithm does not have. The main problem with the old algorithm is that it's hard to process pixels in parallel. It identifies filled scanlines, but this is only suited for processing one pixel at a time. A much more efficient approach is to process 2x2 pixels together. These are called quads. By sharing some setup cost per quad, and using advanced parallel instructions, this results in a significant speedup. Some of the current graphics hardware also uses quad pixel pipelines."
 
Funny thing with software rasterization, is at the point where triangles get pixel sized, all this clipping and such becomes trivial, and effectively all you need is a really fast z buffered scatter (point drawing) and some post processing to clean up the artifacts (assuming you've got the LOD problem solved, which is easy IMO).
 
clipping the edge A->B and clipping B->A didn't have the same outcome
That's easy enough to solve by always clipping in the same direction (from the vertex at the backside of the plane to the one on the front side, or reverse).
 
That's easy enough to solve by always clipping in the same direction (from the vertex at the backside of the plane to the one on the front side, or reverse).
Yep, this is what I did to fix it.
 
It's hard to define 'efficient' (what hardware are you working on?) but to not get crazy trying to implement some robust clipper (ain't easy, believe me :) ) I'd go for homogeneous rasterization.

It's relatively simple to implement and you can easily clip you triangles (and disable clipping when you don't need it to make it faster) and throwing extra clipping planes is a piece of cake. Moreover it maps reasonably well to modern CPUs and GPUs (especially CELL..)
Thanks for the links everyone. I'm not actually writing a software rasterizer, just curious. By efficient I really just meant a rasterizer that's written with some thought to performance and not one that is just written to work. The few articles I read a while back didn't focus much on performance, just functionality.
 
Why would rasterization be less vectorized than shading?
I don't know, but during past Larrabee discussions some people thought the rasterization threads would run on the scalar ALUs while shading runs on the vector units. Presumably they were thinking rasterization would require fewer FLOPs than shading. Which seems to make more and more sense as time goes on.

Prior to those discussions I hadn't envisioned Larrabee working in this way so it got me thinking about how a software rasterizer works and how it would be implemented on Larrabee or GPUs. One thought was does it make sense to even have separate scalar units if rasterization is amenable to vector (SIMD) processing.

As I'm sure you know GPUs typically rasterize blocks of pixels and shading happens in parallel on distinct hardware. With a single threaded software rasterizer I envision it needs to rasterize a number of pixels, store the data in cache, shade the pixels/fragments, and perform the other operations like depth tests and blending. Repeat for other groups of pixels.

Other options I thought of are performing each step distinctly and storing the data in ram between passes. And finally using multiple threads to keep each process running somewhat in parallel. This requires synchronization mechanisms between stages.
 
I don't know, but during past Larrabee discussions some people thought the rasterization threads would run on the scalar ALUs while shading runs on the vector units. Presumably they were thinking rasterization would require fewer FLOPs than shading.

Prior to those discussions I hadn't envisioned Larrabee working in this way so it got me thinking about how a software rasterizer works and how it would be implemented on Larrabee or GPUs. One thought was does it make sense to even have separate scalar units if rasterization is amenable to vector (SIMD) processing.
Rasterization indeed only takes a fraction of computing power compared to shading, but I don't think it makes sense to have scalar units just for this. My expectation is that it just runs as separately scheduled tasks using the entire core including SIMD whenever possible. However, there's definitely a generic need for scalar integer handling. I expect that to be a separate instrution port with SMP support.
As I'm sure you know GPUs typically rasterize blocks of pixels and shading happens in parallel on distinct hardware. With a single threaded software rasterizer I envision it needs to rasterize a number of pixels, store the data in cache, shade the pixels/fragments, and perform the other operations like depth tests and blending. Repeat for other groups of pixels.

Other options I thought of are performing each step distinctly and storing the data in ram between passes. And finally using multiple threads to keep each process running somewhat in parallel. This requires synchronization mechanisms between stages.
Yes those are the general approaches. The interesting thing about software rendering is that you don't have to pin yourself down to one approach. There are many different ways to horizontally and vertically split up the graphics pipeline. Also note that the cache hierarchy automatically adapts to the access pattern. During the same task you might have things that stick around in the cache while other data spills to RAM.

So I don't think we can say much about Larrabee at this time. And as I've mentioned before the approach can change dramatically from one driver generation to the next. The software could even adapt itself to the application's behavior...
3dcgi said:
corysama said:
For a more modern, SIMD approach, the only reference I am aware of is Nick Capen's description of half-space rasterization
Ah so that's Nick's last name. I thought he was like Bono and didn't need a second name. :D
Actually it's Capens with the 's' attached, and my first name is Nicolas in real life. But I have to start considering to officially change it to just Nick. ;)
 
The most efficient rasterizer depends on what it is that you want to rasterize.
The only post that relates to this so far seems to be the one on pixel-sized triangles.
Indeed, triangle size can be of influence for the strategy to take when rendering. Also the expected overdraw. Nearly all 3d hardware uses a z-buffer, but in some cases it's better to eliminate overdraw on a higher level than per-pixel.

Another point is the desired quality and performance, and ofcourse the properties of the target platform.
If I were to write a rasterizer for a Gameboy Advance, I'd take a completely different strategy from writing one for a regular desktop PC, for example.
It all depends on what the hardware is good at.
Technically, you start with designing the content to suit the hardware.
 
Last edited by a moderator:
Back
Top