Advanced Rasterization

Scali said:
I don't think you understood it. What I said was the same as what ehart said, but when ehart said it, you acted like it was the first time you heard it.
I'm sorry, but what ehart said was certainly not the same. It was a lot more interesting:

Your 'idea' was to use the classical rasterization algorithm, but do two scanlines at once and extract the quad masks from that. While this would work absolutely perfect, it has no performance advantages. If you had to fill a triangle with a solid color, the classical algorithm would still be faster, simply because your algorithm is the same plus extra operations.

What ehart suggested was to somehow snap vertices on a low-res grid and determine coverage for big blocks from that, suited for the half-space algorithm. This approach wasn't only completely different from the classical algorithm, it also had the potential of being faster. And with a few modifications and tweaks, this eventually turned out to be the case.
 
While this would work absolutely perfect, it has no performance advantages.

It does, since once you're 'inside', you do not have to test, you know that the entire quad will be inside, and since you start based on the edges, you will never have a quad that is completely outside.

If you had to fill a triangle with a solid color, the classical algorithm would still be faster, simply because your algorithm is the same plus extra operations.

No, it eliminates operations once the quad is completely inside. You no longer need to test.
Effectively you get the same as you get now... A list of quads that are covered by the triangle, some of which are completely covered, some of which are partly covered.

What ehart suggested was to somehow snap vertices on a low-res grid

This was exactly what I meant. You do two scanlines at a time, both snapped to the same pixels.
But I suppose you didn't want to understand anyway.
Obviously I wouldn't suggest anything if it wasn't going to be any faster. What do you take me for?

But don't worry. I think your arrogant behaviour is disgusting, especially considering your ignorance, so I don't feel like helping you again. You managed that nicely.
 
Scali said:

I think that particular website got the specs mixed up with the original Extreme Graphics.
It also doesn't mention the bicubic texture-filtering, for example.

This site compares the 2 generations: http://www.rojakpot.com/default.aspx?location=3&var1=98&var2=2

That would seem more logical. The other specs would indicate that there is no difference between generation 1 and 2.
Sadly Intel doesn't seem to publish the specs. They only boast about Zone Rendering 2.

You should not take Informations from websites for granted without any investigation. Look here : http://www.xbitlabs.com/articles/chipsets/display/i915g_4.html

This synthetic Fillrate Benchmark clearly shows that the Extreme Graphics2 is only 1x2.
 
This synthetic Fillrate Benchmark clearly shows that the Extreme Graphics2 is only 1x2.

I believe it only shows that it is n x 2.
Also, this test seems to be with z-writes? If it is with z-writes, it could still get 533 mpix/s with z-only or colour-only. And that may be what the other websites report.
I find it strange that multiple websites report the 533 mpix/s spec anyway. And I also find it strange that the Extreme I already had 266 mpix/s.
I would like to see a z-only and colour-only test on Extreme I and II.
Or some official Intel documentation on the chip specs. Since it is an unconventional design, perhaps the basic rules of fillrate and pipelines do not apply.

Anyway, it doesn't really matter that much. Even at 266 mpix/s I doubt that Nicolas' software renderer can beat it. It will be interesting to do some benchmarks.
 
Scali said:
It does, since once you're 'inside', you do not have to test, you know that the entire quad will be inside, and since you start based on the edges, you will never have a quad that is completely outside.
That's true, but the entire cost of the classical algorithm is in its setup and tracing the edges. With your method, you add setup cost and complexity for tracing edges. You have to deal with situations where edges are close to horizontal, you have to deal with quads split over the upper half and lower half of a triangle. For every row of quads, you would need three loops. One where quads are partially covered, one where we're totally inside, and one where they're partially covered again. Not to mention that we would have to correctly deal with the situation of a vertical 'spike', where you're not really in any of these cases. And as you know it's not really beneficial to have a lot of tests and jumps. For big polygons, yes, this is not a bad approach. The setup and tracing cost is low compared to the polygon's area.

The new algorithm however is efficient for small as well as big polygons. Setup cost of the half-space functions is really small, just a handful of SSE instructions. There are not even divisions involved. So for small polygons where the extended classical algorithm is wasting so much time because every quad becomes a special case, the new algorithm has no problem. Just evaluate the half-space functions in one go, and the quad masks are ready. Big polygons are fast as well. Testing for coverage of 8x8 pixels is nearly as fast as computing a mask for one quad. So 64 pixels outside of it get skipped very fast, and 64 pixels totally inside of it get accepted immediately.
This was exactly what I meant. You do two scanlines at a time, both snapped to the same pixels.
Skipping blocks is always better than skipping lines, if it can be done efficiently. And the half-space functions allow this. There is no mention of blocks in your approach, and you don't use the functions to rapidly and accurately get coverage info for a large region. So I find it really hard to believe that "this was exactly what you meant". Sorry.
But don't worry. I think your arrogant behaviour is disgusting, especially considering your ignorance, so I don't feel like helping you again. You managed that nicely.
Easy man. What are you getting so hung up about? You didn't reply to it when I first explained to you that it's not the kind of thing I was looking for, and now it's suddenly the central topic and I'm the one who 'ignored' it. I hate to tell you but it's the most trivial looking algorithm, extending on the classical one. But although the idea itself is trivial, it's not that easy nor that efficient to implement. So it's not a brilliant and elegant algorithm at all. And please don't take that as an offense, it was the first thing I came up with myself.

So who's really arrogant and ignorant? You present one idea and immediately expect it to be the best. If I was like that I wouldn't have asked any questions on this forum in the first place. True knowledge exists in knowing that you know nothing.
 
That's true, but the entire cost of the classical algorithm is in its setup and tracing the edges.

Don't you need to calc gradients anyway? The edges would only add two extra gradients, which should be relatively cheap.

For every row of quads, you would need three loops. One where quads are partially covered, one where we're totally inside, and one where they're partially covered again.

You need to have code that handles partially covered quads anyway.

Setup cost of the half-space functions is really small, just a handful of SSE instructions.

But you said there is more per-pixel overhead.

Testing for coverage of 8x8 pixels is nearly as fast as computing a mask for one quad. So 64 pixels outside of it get skipped very fast, and 64 pixels totally inside of it get accepted immediately.

With the method I described, you can choose to snap to as large a grid as you like.

There is no mention of blocks in your approach, and you don't use the functions to rapidly and accurately get coverage info for a large region. So I find it really hard to believe that "this was exactly what you meant". Sorry.

Obviously the two scanlines correspond to the 2x2 blocks you were interested in. And obviously you could do larger blocks by either doing more scanlines at a time, or by doing the upper and lower scanline in the larger block first (if both are inside, everything between them has to be inside aswell, per trapezoid).

You didn't reply to it when I first explained to you that it's not the kind of thing I was looking for, and now it's suddenly the central topic and I'm the one who 'ignored' it.

Since ehart mentioned something very similar, I didn't care. But then you start ranting about how I didn't contribute anything to the thread, which obviously I did.

But although the idea itself is trivial, it's not that easy nor that efficient to implement. So it's not a brilliant and elegant algorithm at all.

You obviously lack the skill and comprehension to implement it in an efficient and elegant way. But that has been my point in this thread the whole time.

You present one idea and immediately expect it to be the best.

Not at all, but the idea I presented was the same as the one ehart presented, the main difference is that the way ehart presented it, made you somehow think about implementing it with the halfspace algorithm (of which I am still not convinced anyway). The idea is still the same though: snap to a larger grid and work out coverage from there, because the coverage is always a single continuous block.
So don't go around saying that I didn't contribute or that my idea wasn't helpful while ehart's was. The only thing is that you didn't understand my idea. And then you are arrogant enough to try and explain to me that my idea is different than eharts. You just think it's different for some reason.
Then again, you think a lot of things. Like that the average P4 2.66 GHz has ~20 gb/s bandwidth. Or that you can implement an efficient raytracer without acceleration structures. Don't act like you are the expert, because obviously you aren't. And don't go around patronizing people who are actually trying to help you, just because you didn't understand what they said. That's just pathethic.
 
Scali said:
Don't you need to calc gradients anyway? The edges would only add two extra gradients, which should be relatively cheap.
Gradients like for color and texture coordinates? Yes I have to calculate those anyway, but they're not part of the rasterization process in the narrow sense. It's mostly independent of it. Calculating gradients for edge stepping is different. I use integer DDA because that's exact and eliminates the need for prestepping per scanline. But it still requires conversions to fixed-point and back, six divisions, and the DDA steps. For small triangles, that's a lot of setup.

Using the half-space functions avoids this. Pixel coverage for a block is calculated in one fast go. No divisions, no jumps, no float to integer conversion, no prestepping.
You need to have code that handles partially covered quads anyway.
Not really. The coverage masks per quad are literally masks. The pixel pipeline always shades four pixels completely in parallel. The MMX maskmovq instruction then very conveniently writes only the pixels that are in the mask.
But you said there is more per-pixel overhead.
Yes, a little, for partially covered blocks. Non-covered or completely covered blocks are close to free. That's what this new algorithm is all about.
Obviously the two scanlines correspond to the 2x2 blocks you were interested in. And obviously you could do larger blocks by either doing more scanlines at a time, or by doing the upper and lower scanline in the larger block first (if both are inside, everything between them has to be inside aswell, per trapezoid).
Yes, that's entirely true of course. You'd still be tracing edges though, "doing more scanlines at a time". The coverage test using half-space functions doesn't require that. And you still also have that setup overhead. The only advantage would be to skip the 'inside' quicker, but with the half-space functions that's nearly for free without tracing, without much setup, and without checking for special cases.
Since ehart mentioned something very similar, I didn't care. But then you start ranting about how I didn't contribute anything to the thread, which obviously I did.
Between all the noise, yes you contributed something. But it was the most straightforward algorithm, with little interesting properties, and it wasn't what ehart suggested. And seriously, it's not a bad thing to try and fail, but please admit it when ideas with more potential are being presented.
You obviously lack the skill and comprehension to implement it in an efficient and elegant way. But that has been my point in this thread the whole time.
Nobody can implement this efficiently and elegantly. It's just not suited for it. There are so many special cases that it's always going to be messy if you want to keep it reasonably efficient, or it's going to be slow if you want to keep it clean.

Just the fact that you say this makes it clear that you lack the skill and comprehension to know when you have to look for other algorithms. I started looking for new algorithms when I started this thread...
Not at all, but the idea I presented was the same as the one ehart presented, the main difference is that the way ehart presented it, made you somehow think about implementing it with the halfspace algorithm (of which I am still not convinced anyway). The idea is still the same though: snap to a larger grid and work out coverage from there, because the coverage is always a single continuous block.
Scali, I knew the direction I didn't want to go into. What you described I had already considered. So although ehart's posts weren't very elaborate, they did spark the ideas of determining coverage of whole blocks using the half-space functions, and computing masks for partially covered blocks in one go. You method was and still is unrelated to quickly determining block coverage using the half-space functions.
So don't go around saying that I didn't contribute or that my idea wasn't helpful while ehart's was.
I can't change that. That's the way things went and it hardly could have been different. Just accept it.
Then again, you think a lot of things. Like that the average P4 2.66 GHz has ~20 gb/s bandwidth. Or that you can implement an efficient raytracer without acceleration structures.
I never did think that.
Don't act like you are the expert, because obviously you aren't.
Now where would I "act" like "the expert"? I weigh technical arguments, I run tests, I experiment, I endure and I succeed. You keep stuck in argumenting.

Or were you implying that you are "the expert"? So you've tried all of this before? I didn't think so.
And don't go around patronizing people who are actually trying to help you, just because you didn't understand what they said. That's just pathethic.
With "people" I assume you mean just you? Because ehart seemed to be really interested to hear about my advancements. Well if this is the way -you- try to help me, then I seriously suggest trying less hard. Really, if you feel patronized and think I didn't understand what you said, then prove me wrong. And I do mean prove. But no, the best thing for you to do is to just try less hard.
 
You still think you're right, aren't you?
God, you disgust me.

I would have more to add, but I am not going to bother. You wouldn't understand anyway, as you've proven before.

Then again, you think a lot of things. Like that the average P4 2.66 GHz has ~20 gb/s bandwidth. Or that you can implement an efficient raytracer without acceleration structures.

I never did think that.

Proof:

With SSE I can do 4 z-tests in a little over 4 clock cycles. On a 2.66 GHz CPU that's 2.66 Gpixels/s. Even though software rendering obviously still has a lot of extra things to compute, this particular stage is ten times faster than hardware rendering, raw power.

Right, and memory bandwidth wouldn't have anything to do with that, I suppose? If we assume 24 or 32 bit zbuffer, we will need 4 bytes per pixel, also, you will need to read and write each pixel in many cases. Which means we will need 2.66*4*2 = ~20 GB/s memory bandwidth.
You will not get near this kind of bandwidth in practice. 1/10th of that bandwidth would be more realistic, which brings us back to the level of Intel Extreme II. So, perhaps you can beat it with z-only (if rendering a lot front-to-back), but the Intel Extreme II can render colour aswell, at very little extra cost, which you cannot. Intel Extreme II can also do zbuffering while the CPU does the geometry setup for the next triangles, which you cannot. So with reasonably high polycount, the Intel Extreme II will probably still be lots faster, because it processes in parallel.
Besides, I would want to see you beat it before I believe it. In no way is it 10 times faster than an Intel Extreme II in practice, I am sure.

And:

An extreme example: Imagine a raytracer using the D3D9 interface.

This isn't "extreme" to imagine. After all, you just pass geometry, material and light information. That's all you need to start raytracing. If I'm not mistaken there's a project that uses the OpenGL interface for ray-tracing.

If you want to raytrace on raw triangle meshes without pregenerated BSP-trees or the like, you are more naive than I thought.
This demonstrates my point as painfully as it can be: You get a raw set of triangles... Two choices:
1) Generate the BSP-tree on-the-fly, which could take QUITE a while.
2) Skip the BSP-tree and bruteforce the ray against all triangles in the mesh, which could take QUITE a while.

While you could implement the raytracer with O(log N) complexity, you now have to implement it with O(N) complexity because the interface doesn't give enough information.

Ofcourse, you could argue that as long as the vertexdata is static, you can re-use the same tree, so you only generate it when the data changes... But that doesn't help when you are using vertexshaders, does it?

And how would you make use of the advantages of raytracing, like shadows or reflections?

Also, if you mean OpenRT... that's a project with an OpenGL-like interface. It's not the same, but it bears resemblance to OpenGL.

So why don't you just admit that you are in over your head, just shut up when people who DO know what they are talking about speak, and just deliver that renderer that can beat an Intel Extreme, so everybody, including you, can see that you were blatantly wrong in your assumptions, and eat some humble pie.
 
I was hoping this thread was dead, but since it keeps going, I'd like to add a call for civility. The ongoing discussion really has nothing to do with the topic, and I would hate to see admins locking threads/warning people over a tit for tat discussion like this.

For the record Scali, both I and Nick did not seem to understand the suggestion you made as being the same as the suggestion I made. I understood your suggestion to be one in which you changed the inner loop to compute coverage on the odd row, then on the even row, then continue processing. Nick and I both dismissed your suggestion because this would have been too expensive. I am not trying to gang u here, just to let you know that a lot of what seems to be upsetting you is related to most people in the thread misunderstanding your post.

Finally, even though I work for a company that makes fast programmable raster HW, I still see uses for writing SW rasterizers. There are good reasons/uses, not the least of which is learning new/different ways of implementing pieces of the process. When the Olano paper on the half spaces was published, fast raster HW already existed. Nick obviously feels the same way I do, he is just working for the wrong company right now. ;)

-Evan
 
I am not trying to gang u here, just to let you know that a lot of what seems to be upsetting you is related to most people in the thread misunderstanding your post.

To be exact, what upsets me is that instead of asking what I mean exactly, or asking me to explain in more detail how something could be implemented and gain performance, or anything, it is dismissed without understanding, and then someone has the arrogance to say that I didn't try to contribute anything.

Finally, even though I work for a company that makes fast programmable raster HW, I still see uses for writing SW rasterizers. There are good reasons/uses, not the least of which is learning new/different ways of implementing pieces of the process.

This is again a misunderstanding. I am not against sw rendering... In fact the ultimate renderer, Renderman, is still software. I just think that Nick's approach is suboptimal for a sw renderer, and a different type of renderer would do the idea of dynamic shader compilation more justice.
For example, Renderman is neither a triangle rasterizer nor using the DirectX 9 API.
I also think that Nick often thinks too lightly about situations, as demonstrated by the two cases above, the P4 bandwidth problem and a raytracer with an immediate mode rendering interface. It would do him good to step away from his project, and look around at other types of rendering, most of which are still done in software, with success.
And ofcourse study processor and chipset architectures and limitations, and study general algorithms and datastructures, so he can better understand suggestions of other people, and better predict performance or efficiency of an algorithm.
 
I think if everyone looked at the reasons why not to do something, rather than why, nothing interesting or important would ever get done. I can't offer any technical help to your sw render as I kinda stopped working with graphics when I too coded in mode 13h and did a rotating cude with gourad shading ( and a very primative transparency using a get pixel routine, hah!), but maybe some positive points?

I really believe that if you had the render working with SM4 before hardware comes out it will be infinately usefull. I know someone who could even use a good SM3 and 2.0b renderer now just to test shaders and features.

Also, I sold my 9800 Pro recently to upgrade to a new card and was left for a couple days without a graphics card ( I don't call an ATI Rage II a graphics card ;p ) and it was a breath of fresh air when I realised that UT2004 was still playable with software, even though it was kinda crappy :)
 
Thanks Evan and MrBored! I much appreciate your encouragements.

Scali, let's keep it at this, ok. This thread stops here. It's not helping anyone to continue the discussion. Demos in a couple months, promised, so you don't have to wait long and then you will be on topic, if you don't get personal again...
 
Nick said:
Thanks Evan and MrBored! I much appreciate your encouragements.

Scali, let's keep it at this, ok. This thread stops here. It's not helping anyone to continue the discussion. Demos in a couple months, promised, so you don't have to wait long and then you will be on topic, if you don't get personal again...

I don't think I got personal, I just discussed technical disadvantages of your approach. Perhaps not entirely on-topic, but you weren't exactly putting in effort to be on-topic either.
I think it's more a case of you taking criticism personally.
And then you get personal by saying I don't even contribute to this thread, while I have. Both on-topic and off-topic, I have added technical background information and such, which I consider a contribution, whether you value it or not.
 
Scali said:
I don't think I got personal
I was hoping that you and Nick would have taken my hint and just simply stopped but the above comment made by gape in disbelief.

Scali said:
God, you disgust me.
That was a comment unbefitting any form of technical argument and clearly shows that you were taking this personally.

May I request that both you and Nick stop the tit-for-tat bickering, please?
 
That was a comment unbefitting any form of technical argument and clearly shows that you were taking this personally.

Yes, but that was after the initial technical arguments, when Nick started talking about how I didn't even contribute and all that, which I obviously did take personally. And when someone makes personal comments like that, I find it only natural that any responses would also be of a personal nature.
 
Scali said:
Nick said:
That's true, but the entire cost of the classical algorithm is in its setup and tracing the edges.

Don't you need to calc gradients anyway? The edges would only add two extra gradients, which should be relatively cheap.

Nick said:
Gradients like for color and texture coordinates? Yes I have to calculate those anyway, but they're not part of the rasterization process in the narrow sense. It's mostly independent of it. Calculating gradients for edge stepping is different. I use integer DDA because that's exact and eliminates the need for prestepping per scanline. But it still requires conversions to fixed-point and back, six divisions, and the DDA steps. For small triangles, that's a lot of setup.

Nick, what's with the DDA - shouldn't the half-space function allow you not to care about edge stepping precision anyway - you just have to roughly follow the edge in x/y coords, and that's for efficiency purposes only, the half-space function and the triangle gradients would take care of the rest. or did i misunderstand something?
 
darkblu said:
Nick, what's with the DDA - shouldn't the half-space function allow you not to care about edge stepping precision anyway - you just have to roughly follow the edge in x/y coords, and that's for efficiency purposes only, the half-space function and the triangle gradients would take care of the rest. or did i misunderstand something?
Hi darkblu!

I think you understood that part perfectely, but the relation with performance is a bit more complicated. There's a fine balance between setup cost and per-pixel cost. And in the case of prestepping it's often even a per-scanline cost. Allow me to summarize:

When tracing edges, there are two main options, and a third for processing quads. Either we trace edges exactly, using a DDA, and then we only have to do prestepping once (at the start). Or, we compute the starting value of interpolants once every scanline. I found the former to be slightly faster. The third option is only just an extension of this, to combine two scanlines and process quads. Per two scanlines we have more work, to handle all the special cases at the eges so we get the coverage masks for the quads (no less than 16 different situations). Once inside the polygon though, quads are skipped quickly. So it benefits only big polygons.

Using the half-space functions is quite different. There are again three options. The first is to scan rectangles. This can be relatively fast and only requires prestepping once, but for whole bounding rectangles it's wasteful. The improvements using 8x8 blocks presented here solves this, at the very small cost of determining coverage and prestepping per block. It re-uses the half-space functions, so there's no real extra setup. The last option (which I believe is what you mention) would be to 'roughly' trace the polygon edges, avoiding to miss pixels of course...

Unfortunately, this requires setup to trace edges, the setup for half-space functions, prestepping per scanline (unless using 'rough' DDA), and actually tracing edges and evaluating the half-space functions. So that's almost the sum of all the costs together. The only thing it does avoid is evaluating the half-space functions outside the polygon. With even more setup, and handling many special cases, it can also avoid evaluating the half-space functions inside the polygon. So for small and average sized polygons, that's more setup than is worth avoiding half-space function evaluation. Only for big polygons the extra work starts paying off, but then it's still hardly faster than the classical algorithm.

Using 8x8 blocks ended up being a really nice compromise, and the implementation is very elegant. Small polygons require little setup, and evaluating the half-space functions in an unrolled loop is really fast. Bigger polygons benefit from quickly rejecting or accepting blocks totally outside or inside the polygon. The new algorithm also has other nice side-effects. Texture sampling locality is improved, so the cache is used more efficiently. And since I generate rasterization functions dynamically, it's good that the implementation is compact. It's also scalable. I'm planning on making the block size a configurable parameter. So every type of application can choose the most optimal parameters.

Anyway, I tested and/or experimented with each of these six options. If I completely missed an option or you know a way to make another one efficient to implement, please let me know. For now, I'm really excited with the performance and additional benefits of the new algorithm. It simply 'makes sense', if you know what I mean.

Cheers,

Nick
 
I have to second Nick here on the benefit of the simplicity of the code. I tried it myself this weekend, and the simplicity is just great. I really like it from the standpoint of things like multisample coverage.

I am wondering about a statement by Nick earlier in the thread though. Have you considered tuning any texture/framebuffer tiling to more effectively match the new raster pattern you get from this algorithm?

-Evan
 
Nick said:
Anyway, I tested and/or experimented with each of these six options. If I completely missed an option or you know a way to make another one efficient to implement, please let me know. For now, I'm really excited with the performance and additional benefits of the new algorithm. It simply 'makes sense', if you know what I mean.

thanks for the algo 'unrolment', Nick. it believe i see now why you went for the 'macro-ization' scheme.
and yes - this algo does make sense indeed, is elegant and has potential. i too have pondered along these lines of though but never got to implementation (have this thing for scanlines and prestepping : ). kudos for your work breakthrough, i'd bemore than curious to check your code once i have more time on my hands (as of now i have almost frozen any home activities, and i'm yet to finish those cube maps for thurp.. doh, even reading up on b3d is beyond my limits now)

nevermind, yay to sw rasterizers! 8)

ps: thinking of a better 'coverage determination' scheme for your algo, thoughts of baricentirc coords circle in my head for some reason.. will go collect then some time later on.
 
ehart said:
I have to second Nick here on the benefit of the simplicity of the code. I tried it myself this weekend, and the simplicity is just great. I really like it from the standpoint of things like multisample coverage.
That's another advantage. I hadn't really thought about that. Some assembly tricks might random sample positions a possibility. 8)
I am wondering about a statement by Nick earlier in the thread though. Have you considered tuning any texture/framebuffer tiling to more effectively match the new raster pattern you get from this algorithm?
I'm not sure. Last time I tried it, the average performance remained the same. It only improved things when the texture's primary axis was orthogonal to the screen's primary axis. But de-swizzling the addresses was not for free. It might be different now on the current generation of processors, with high frequencies but also high penalties for cache misses. Then again hardware prefetching is really advanced. Anyway, it's worth trying...

Thanks!
 
Back
Top