Advanced Rasterization

Nick

Veteran
I have implemented the algorithm in Triangle Scan Conversion using 2D Homogeneous Coordinates, in software, but the performance is quite dissapointing. I process 2x2 pixels at a time, using SSE assembly, but it's still almost twice as slow as a C implemenation of a regular scan-line converter.

The problem is that it evaluates three half-space functions for all pixels in the bounding rectangle of a triangle. As illustrated in the article, that's often a lot more pixels than just the ones that have to be filled.

So, I was wondering if anyone knows how this is handled in modern hardware. Do they use extra tricks to quickly determine which quads don't lie inside the triangle? Or do they just use raw processing power? Any ideas that might be of use to me?

Thanks.
 
Nick said:
So, I was wondering if anyone knows how this is handled in modern hardware. Do they use extra tricks to quickly determine which quads don't lie inside the triangle? Or do they just use raw processing power? Any ideas that might be of use to me?
You've got to realise that what is the "best" way to do something in SW is not necessarily the best way in HW and vice versa.
 
You've got to realise that what is the "best" way to do something in SW is not necessarily the best way in HW and vice versa.

I gave that advice to ole Nicolas many times, but he just won't listen ;)
 
Simon F said:
You've got to realise that what is the "best" way to do something in SW is not necessarily the best way in HW and vice versa.
Don't worry, I fully realize that.

But when an algorithm is 'efficient' for hardware implementation, they mostly mean that the operations are done in parallel and the total time to complete a task is minimal for average or best case situations. What they often do not mention is that extra parallelism requires extra silicon, and worst case situations are not handled optimally. So, supposing that hardware rasterization just uses brute processing power and scans the whole bounding rectangle, this implies that sufficient die area is spent on this stage to keep the pipelines full. If, on the other hand, a more sophisticated algorithm is used, it is quite possible that less transistors are required so the rasterization is just overall more efficient.

So, while you're absolutely right that many algorithms suited for hardware are not optimal for software, there truely are many situations where there's a big relationship. When a new algorithm in software requires half as many operations than the previous algorithm, chances are great that if it's implemented in hardware it's going to either double performance or half the required die space, or a combination of that. Vice versa, I experienced many times that when I learned more about the hardware implementations, applying those methods to my software resulted in a speedup.

Although I might be out of luck this time, I was hoping if someone knew about such optimizations of this algorithm. All idea's welcome...
 
Scali said:
I gave that advice to ole Nicolas many times, but he just won't listen ;)
That's quite ironic Mr. Scali, because this topic is very much related to something you convinced me to do... ;)

I want to try processing 2x2 pixels, quads, totally in parallel with SSE/MMX. Although this complicates the rasterization and raster operation stages, shaders could benefit from this a lot (hoping that L1 caches make up for the lack of registers). It might also make texture filtering significantly faster.

So, what I need is 2x2 masks telling me which pixels are covered. The algorithm lends itself perfectly for this, since SSE compare instructions (testing in which half-space a pixel lies) produce masks which are easy to AND together to get the total mask. When there's no coverage, it's easy to skip the pixel pipeline. And very luckily, MMX has instructions to selectively write pixels, based on the mask. So all this is implemented pretty much as efficient as it gets. Unfortunately simply filling the polygon with a constant color is still twice as slow in average than using the classical edge-stepping algorithm. The reason is that the latter algorithm does no extra computations per pixel. It just finds the endpoints of a scanline and then fills it using a loop which only has to write the color.

Obviously, I don't want to end up with a pixel pipeline that is maybe twice as fast, and a rasterizer that immediately halves my fillrate. What I need ideally is a way to identify non-covered and fully covered quads much quicker, so computing the mask with SSE is only done at the edges of a polygon.

I'd be very grateful if you had any idea's to solve this problem. You're not limited to use this algorithm of course, though it seems like a good starting point to me. Thanks.
 
That's quite ironic Mr. Scali, because this topic is very much related to something you convinced me to do...

How is that?
I recall advising you not to shoehorn a software renderer into a D3D9 API framework, yet you still seem bent on compiling D3D9 shaders? :)

So what exactly is it that you are trying to do, and how exactly did I convince you?

As for your problem, I am not too familiar with the alternative homogenous rasterizing method, but it seems to me like it introduces more problems with the generation of this mask than the regular scanline conversion?
With regular scanline conversion it's quite simple I guess, you can interpolate 2 scanlines at the same time, and both start at 'outside'... the moment they are both 'inside', you can assume 2x2 blocks to be inside, until the right edge comes near (at all times you know exactly how far each edge is from your current rendering position).

Oh, as for the hardware-solution... Ofcourse the writemask for the quad is very simple there. Skipping entire quads is often done via early-z and a hierarchical zbuffer of some sort (or I believe some people call it z-compression).
The basic idea is to have a minimum z-value for all pixels in a quad, and the maximum z-value for the same pixels in the zbuffer. If the first is smaller than the second, then all pixels are rejected.
This can be applied at multiple levels, so even larger parts than just single quads could be rejected from a triangle early.

Obviously in hardware this solution is good, because it's very easy to make the circuitry that finds these z values, and insert them early in the pipeline so there is no additional cost. In software this solution makes very little sense, because getting the min/max z-values is not 'free'.
 
Scali said:
I recall advising you not to shoehorn a software renderer into a D3D9 API framework, yet you still seem bent on compiling D3D9 shaders? :)
Yes, that was the last piece of 'advice' you gave me in our very lengthy e-mail conversations. ;) But I also told you that:

- People generally don't want to take the effort to learn a new interface.
- My own interface is still there if you want, rasterization and shaders are orthogonal to it.
- I can provide extensions to the DirectX interface to access functions useful to software rendering.
- It's a hobby. It has learned me things I would have never learned otherwise.
As for your problem, I am not too familiar with the alternative homogenous rasterizing method, but it seems to me like it introduces more problems with the generation of this mask than the regular scanline conversion?
It's quite simple actually. For every edge, there's a function that is positive on one side, and negative on the other side. When all these half-space functions are positive, the point is inside the triangle. The functions are linear, so it's easy to interpolate over a rectangle covering the triangle.
With regular scanline conversion it's quite simple I guess, you can interpolate 2 scanlines at the same time, and both start at 'outside'... the moment they are both 'inside', you can assume 2x2 blocks to be inside, until the right edge comes near (at all times you know exactly how far each edge is from your current rendering position).
That works of course, but it adds operations to the classical algorithm. Getting the mask out of scanline limits just ain't trivial. This might sound demanding; but there's no way I'm going to sacrifice fillrate. Every stage of my new renderer has to be at least as fast as the previous one (average case).

There's one new idea I have come up with lately: cutting corners, literally. Since the algorithm is about twice as slow as the previous, it would already help a lot to avoid non-covered quads. In average a triangle doesn't cover half the quads in its bounding rectangle. This only requires new endpoint for the inner loop (in the horizontal direction). These could be computed by very coarsly tracing the triangle's edges, using integer math. It needs a tiny bit of setup to start on the 'outermost scanline', but then the cost is just two additions per two scanlines. For small triangles there is no benefit so the setup can be skipped. For very big triangles it might make sense to compute coarse inner edge limits as well, to fill completely covered quads rapidly.

Sigh, I'm always hoping someone knows the magical algorithm, but I always end up doing things myself. :rolleyes: Thanks anyway, I'll keep you guys updated on the progress. In the meantime new idea's are still very welcome. :!:
 
I coded something like this a while ago, and the optimization I worked thorugh, but never got around to implementing was as follows:

1. snap the vertices on a low resolution grid
2. rasterize on a low res grid where each block is 2x2, 4x4 or 8x8 pixels
3. Apply the half-space alogorithm to the entire block

-Evan
 
Nick said:
If, on the other hand, a more sophisticated algorithm is used, it is quite possible that less transistors are required so the rasterization is just overall more efficient.

Still the cost of operations in cycles in a general purpose processor is very poorly correlated with the die-area they take. Take the cost of a multiplication versus a comparison on the processor, and in hardware.
 
Sigh, I'm always hoping someone knows the magical algorithm, but I always end up doing things myself.

I think the magical 'algorithm' is hardware :)
As the examples I made showed you (and now Doom3 aswell ofcourse), even with ancient non-shader videocards or onboard GPUs you can do per-pixel lighting much faster than completely on the latest CPU.
Which is probably the reason why software rasterization hasn't really developed much since Quake 1.

I believe that the Quake 1 design may still be the most efficient today, apart from adding MMX/SSE optimizations ofcourse.
Problem today is that we use way too many polys for any CPU to handle anyway. Look how much Doom3 is struggling with its skinning and shadowvolume generation on the CPU. If it also had to rasterize all the triangles, it would die completely.
CPUs are never going to catch up with GPUs anymore.
So, my advice would still be to try and do something different, if you're not using a GPU. Either non-realtime photorealistic stuff, or alternative means of rendering.
For example, I believe that raytracing is more efficient for a CPU if you want per-pixel light and shadows (perhaps with a hybrid approach, where the first pass is rasterized). Perhaps some clever subdivision scheme would also help software rendering, by keeping the polycount as low as possible.

But software rasterizing for games is dead. As your own Q3 viewer shows, you can barely run a Q3 level in 640x480 at a decent framerate on modern CPUs, with the quality of first generation 3d hardware at best.
Who would want that? Any onboard GPU can do that better, and it wouldn't require a fast CPU either, if a GPU was used.

So I would suggest taking your vertex processing pipeline, and coupling that with fixedfunction hardware, much like my examples did, or Doom3 does on low-end hardware (and high-end, but that's just silly design).
Or if you want to go completely software, do something different than rasterizing, because rasterizing is not going to work.
Surely you must realize by now that there is no 'magical algorithm' that will make your rasterizer 10 times faster, so it can actually be used for modern games. Your rasterizer as it is right now, is already quite efficient, it is not going to get much better.
 
ehart said:
I coded something like this a while ago, and the optimization I worked thorugh, but never got around to implementing was as follows:

1. snap the vertices on a low resolution grid
2. rasterize on a low res grid where each block is 2x2, 4x4 or 8x8 pixels
3. Apply the half-space alogorithm to the entire block
That sounds like an interesting approach.

I'm not sure how you would implement it elegantly though. Snapping on a low resolution grid seems non-trivial. A small triangle might cover a few pixels in high-res, but miss the pixel centers in low-res. Or all vertices could be snapped to one center, that's not the intention either.

Do you have the details on this method? Thanks!
 
Dear Scali,

We've been through this discussion before. Software rendering still has its uses, to me. If you don't share that opinion that's perfectly fine, but please don't try to convince me to do something else. To sum up my arguments:

- There's nothing like my software on the market.
- I already had one contract to integrate it into a game engine. Other people show interest as well.
- I've written two articles about it. One for ShaderX2 and one for ShaderX3. Wolfgang is always excited to hear about my advancements.
- It has gotten me into nVIDIA. In fact I'm here right now, I'll give you my e-mail and phone number in a PM if you want.
- Games are not everything. Think about internet, educational software, CAD. They might want to use advanced shaders without requiring high framerate or resolution.
- Amateur developers are enthusiastic about being able to test shaders without expensive hardware. swShader is free and will remain free for educational purposes.
- I don't expect to sell dozens of licenses or turn it into a real business. It remains a fun hobby, that helps paying my education.
- Everybody works with hardware these days. There's little I can do there that hasn't been done before. I like sticking out.
- I'm still a student, so educating and challenging myself is useful no matter how little use the project has that I'm working on.
- Pixomatic only has DirectX 7-like features. Pixomatic sells for $10,000.00
- I don't wish to compete with modern hardware, but I'm beating the crap out of the reference rasterizer.
- You haven't seen half of the things I'm working on now.
- There's nothing like my software on the market.

I would almost start believing that you're trying to chase me away from this tiny but existent market. :rolleyes:

Anyway, no offense, but I would really very much appreciate it if you stopped telling me what to do unless you have something constructive and on topic to say. I know you have the intelligence to to say something useful and give me new idea's, even if it's not the magic algorithm. And I'm still very thankful for starting that ATI car demo.

Take care,

The Chosen One
 
We've been through this discussion before. Software rendering still has its uses, to me. If you don't share that opinion that's perfectly fine, but please don't try to convince me to do something else.

I never said you shouldn't do software rendering, I merely gave some ideas that would be more useful than what you are doing now.

- There's nothing like my software on the market.

Whether you interpret that as good or bad is up to you, I suppose ;)

- It has gotten me into nVIDIA. In fact I'm here right now, I'll give you my e-mail and phone number in a PM if you want.

Okay, what do you do there, and can you get me a job there aswell, while you're at it? My current project is just about finished anyway, and I could use a new one :)

- Games are not everything. Think about internet, educational software, CAD. They might want to use advanced shaders without requiring high framerate or resolution.

Isn't the raytracing-thing I suggested pretty much a non-game thing?

- Amateur developers are enthusiastic about being able to test shaders without expensive hardware. swShader is free and will remain free for educational purposes.

Well you can get SM2.0 hardware for about 60e these days, which is still considerably faster than your swShader on the fastest CPU today.
Makes me think that if these amateurs can afford a CPU that allows realtime swShader stuff, they can also afford an SM2.0 card :)

- Everybody works with hardware these days. There's little I can do there that hasn't been done before. I like sticking out.

But there is, as I said before.

- I'm still a student, so educating and challenging myself is useful no matter how little use the project has that I'm working on.

Oh yea, because of your attitude I forgot that this is still actually only your first renderer :)

- Pixomatic only has DirectX 7-like features. Pixomatic sells for $10,000.00

Pixomatic has the Abrash-name attached to it. We both know it's a pile of shit, but Abrash sells.

- I don't wish to compete with modern hardware, but I'm beating the crap out of the reference rasterizer.

My Java engine beats the crap out of the reference rasterizer, what's your point? :)
Refrast has entirely different goals, as I most probably have said before.

- You haven't seen half of the things I'm working on now.

So show me something then... Last thing I saw were the shaders I coded for you. Did you get cubemapping implemented yet?

- There's nothing like my software on the market.

Uhuh :)

I would almost start believing that you're trying to chase me away from this tiny but existent market.

Indeed, better try and get me on your team then ;)

Anyway, no offense, but I would really very much appreciate it if you stopped telling me what to do unless you have something constructive and on topic to say.

I do think my criticism was constructive.

I know you have the intelligence to to say something useful and give me new idea's, even if it's not the magic algorithm.

As I said before, the magic algorithm you seek doesn't exist, stop looking for it and put your energy into more constructive things.
 
Your post looks an aweful lot like our e-mail conversations. ;)
Scali said:
I never said you shouldn't do software rendering, I merely gave some ideas that would be more useful than what you are doing now.
I appreciate your concern, but I think it's still up to me to decide what I'm feeling most happy with. It's like... someone telling me I should do another job that pays more, while I'm really happy with what I got now and I don't care about the extra money.
Whether you interpret that as good or bad is up to you, I suppose ;)
Ah, see, you start getting it. ;)
Okay, what do you do there, and can you get me a job there aswell, while you're at it? My current project is just about finished anyway, and I could use a new one :)
I'm on a two-month internship. I can't tell much about the project I'm working on, but some skills I got from software rendering are quite useful here. I could mention your name to some managers, if you tell me your true name. :rolleyes:
Isn't the raytracing-thing I suggested pretty much a non-game thing?
Yes, but internet, educational applications and CAD are not asking for a new raytracer. Well, I'm sure there's a market for it of course, but I can't possibly compete against professional tools, that already have run-time SIMD compilation technology. If I did it wouldn't be a hobby project any more. But I'll keep it in mind for the future...
Well you can get SM2.0 hardware for about 60e these days, which is still considerably faster than your swShader on the fastest CPU today.
Makes me think that if these amateurs can afford a CPU that allows realtime swShader stuff, they can also afford an SM2.0 card :)
Sure, but it's not because these cheap graphics cards are available that people immediately 'upgrade'. Or would you replace a Geforce 4 Ti with a Geforce FX 5200? Furthermore, I'll be SM 4.0 compatible long before the hardware is affordable. Not only amateurs love that...
Oh yea, because of your attitude I forgot that this is still actually only your first renderer :)
What are you trying to say? This is my fourth generation renderer.
Pixomatic has the Abrash-name attached to it. We both know it's a pile of shit, but Abrash sells.
Either way it sells. They didn't include it in Unreal Tournament for nothing. And I would be more than happy to sell it for a fraction of their price.
My Java engine beats the crap out of the reference rasterizer, what's your point? :)
Refrast has entirely different goals, as I most probably have said before.
Refrast and your Java renderer, although respectable, both don't use the CPU optimally. They are no option for a fallback. People who design their software to run on all x86 systems won't choose refrast.
So show me something then... Last thing I saw were the shaders I coded for you. Did you get cubemapping implemented yet?
Hints can be found above. Demos will be ready in a couple of months.
I do think my criticism was constructive.
Well, I certainly understand the way you look at it. In fact, if I hadn't continued to work on my renderer and started working with hardware, I would probably think the same way. So don't feel offended if I ignore your criticism, and I'd like to refer again to my first answer in this post. I'm happy with what I do. :D
As I said before, the magic algorithm you seek doesn't exist, stop looking for it and put your energy into more constructive things.
You would have given up long ago, wouldn't you? :LOL: Yesterday evening I managed to get a solid 30% performance boost, and I'm not even half-way through my list of idea's to try. The magic algorithm might not exist, but I'm getting close, and results will be satisfying, to me...
 
Here's my magic algorithm:

Basically, I want to skip a row of empty quads quickly. So let's just do that. But just skipping ~8 pixel to test for coverage on a big grid doesn't work, because we'll easily miss a part of the polygon. But. We can't miss halfspace changes. There's simply a positive side of the edge, and a negative side. So first imagine we're rasterizing a vertical strip. We start on the left side, where the left edge function is negative, and the rigt edge function is positive. When we step from one pixel to the next, we detect that we're in the strip when both are positive. When skipping pixels however, we can detect when we missed the strip, if the left edge function is positive and the right edge function is negative. When that happens, just jump back, and we've got a good starting point for stepping one pixel at a time (or quad actually).

So, to implement this for triangles, I simply have to use three tight loops, one per edge function. Each of them skips ~8 pixels, stopping as soon as the corresponding edge function has become positive. When that happens, skip back to the previous position. Once we started the real scanline loop, we can jump out of it as soon as the coverage mask is empty (no need to test the bounding rectangle's edge explicitely).

So the end result should be that for medium and large polygons, I only scan roughly half of the pixels in the bounding rectangle, which together with some optimizations I found in the setup should give me my doubled fillrate. 8) Huge and tiny polygons could take different paths, although I hope to put it all in an elegant one-for-all function.
 
My method was to alter rasterization rules used on the low res scan conversion, so that the all blocks touched were considered inside the triangle.

Your solution sounds good. It makes me want to start playing with SW rendering again. Have you considered calculating a scaline intersection, rather than stepping by eights?

-Evan
 
ehart said:
My method was to alter rasterization rules used on the low res scan conversion, so that the all blocks touched were considered inside the triangle.
I've been thinking about that as well, but I can't seem to find a fast way to detect when a block is 'touched'. A variant of my new idea might do it though. Thinking about the vertical strip as a column of low-res blocks. I have to give that more thought...
Your solution sounds good. It makes me want to start playing with SW rendering again. Have you considered calculating a scaline intersection, rather than stepping by eights?
I certainly have considered that, but it requires divisions and conversions, which aren't exactly cheap. For the average polygon I'm hoping that my method is a good compromise. Big polygons require more shading so rasterization isn't that important (but still important enough for raw fillrate of course), and small triangles have small bounding rectangles anyway (so I shoudn't spend a lot of effort trying to avoid empty quads). Anyway, there are still many ideas to try, and one of them has to be the magic algorithm. ;)

Thanks!
 
Ok, it's all starting to make sense now. 8)

I think checking which 8x8 block is touched is in a way equivalent to my per-scanline algorithm, except that it also works vertically. Checking the edge functions at the intersections of a 8x8 grid, should reveal whether there is or was a polygon in between. The advantage is that it skips more pixels at once, but a polygon that only 'touches' the 8x8 block would cause a lot of empty quads to be scanned. So it's just a matter of granularity, although blocks somehow seem better than scanlines to me. 4x4 blocks could be better than 8 pixel per scanline.

Either way, I got some nice parameters to tweak. ;)

Thanks again!
 
Back
Top