Advanced Rasterization

Nick · Aug 24, 2004

darkblu said:
ps: thinking of a better 'coverage determination' scheme for your algo, thoughts of baricentirc coords circle in my head for some reason.. will go collect then some time later on.

I had ideas along those lines as well. The algorithm using half-space functions requires to evaluate and compare three values (which is as simple as linear interpolation and taking the sign bits, all greatly parallelizable with SSE). But anyway, if it was possible to require only one variable that determines whether the pixel is inside or ourside, and which is as easy to evaluate, that could be a big win.

Barycentric coordinates, seems like you'll need three as well... :?

An idea just popped into my head. Gotta try something... Thanks!

Nick · Aug 25, 2004

...nevermind. It's obvioulsy possible to construct one function that is only positive inside the triangle, but at a higher cost per-pixel. Interpolating and comparing three half-space values per pixel ain't bad at all.

Scali · Aug 25, 2004

At this point I would like to point out that the idea that I proposed didn't require any per-pixel calculations.

Nick · Aug 25, 2004

Scali said:
At this point I would like to point out that the idea that I proposed didn't require any per-pixel calculations.

Hi Scali,

Yes, we know this, it's been discussed before. To summarize it again: the setup cost to get this 'zero' per-pixel cost is too high for small and average size polygons. It complicates things at the edges. The new algorithm also has zero per-pixel cost, for blocks that are completely inside (or outside) the polygon. However, the setup cost for this is very small, as the half-space functions (which are easy to compute) are just re-used and evalutated per block.

Anyway, like I said, there's a fine balance between these costs. The only way to understand them completely is to run tests. And my tests don't lie, the new algorithm performs very well for all situations.

Scali · Aug 25, 2004

Okay, let's see how quickly it can run this then: http://scali.eu.org/~bohemiq/Shadows.rar

Should be nice and fillrate-intensive, and has both large and small polygons.
I'd like to see some benchmark figures.

Also, your edge setup code must be pretty suboptimal if calcing edge gradients is somehow a bigger deal than calcing all the other gradients.

Nick · Aug 26, 2004

That's a cool demo. Thanks!

On my Intel Extreme Graphics II it renders at 27 FPS. Now I got a number to work towards.

The resolution is modest, but generating the cubemaps obviously adds a lot of work. How frequently do you update them, and what resolution do they have? The stenciling shouldn't be much of a problem once I do this with 8x8 blocks with MMX.

Also, your edge setup code must be pretty suboptimal if calcing edge gradients is somehow a bigger deal than calcing all the other gradients.

Not really. It requires six integer divisions, to start with, while gradients for interpolants don't. There are also several branches required, which isn't a Pentium 4's favorite instruction either. Sub-pixel precision also requires several float to integer to float conversions. So it's rather the gradients for interpolants are cheap to calculate. Implementing the algorithm with half-space functions, everything remains floating-point, and perfect for SSE.

Scali · Aug 26, 2004

On my Intel Extreme Graphics II it renders at 27 FPS. Now I got a number to work towards

You'd be surprised how fast an Intel Extreme is, really

Still, my GF2GTS rendered it at about 75 fps, I think. And my Radeon 9600Pro does 200 fps. But I was suprised to see the Intel doing it 'realtime' still.

The resolution is modest, but generating the cubemaps obviously adds a lot of work. How frequently do you update them, and what resolution do they have?

It runs in 640x480, 32 bit. The textures are compressed if possible. The cubemaps are 128x128, 32 bit I believe. I'm not sure if I used mipmapping aswell, probably not. One cubemap is updated every frame.

Not really. It requires six integer divisions, to start with, while gradients for interpolants don't. There are also several branches required, which isn't a Pentium 4's favorite instruction either. Sub-pixel precision also requires several float to integer to float conversions. So it's rather the gradients for interpolants are cheap to calculate.

Apparently I have an entirely different way to calc the gradients then. I don't have that many divisions and not that many jumps, I believe. And I share a lot of calculations with floats.

Nick · Aug 27, 2004

Scali said:
It runs in 640x480, 32 bit. The textures are compressed if possible. The cubemaps are 128x128, 32 bit I believe. I'm not sure if I used mipmapping aswell, probably not. One cubemap is updated every frame.

Cool. A lot will depend on my cubemap sampling performance for quads. It's quite unoptimal for single pixels, but if I can parallelize it efficiently... I first have to finish the other parts of the pipelines though, but I'll keep you updated.

Apparently I have an entirely different way to calc the gradients then. I don't have that many divisions and not that many jumps, I believe. And I share a lot of calculations with floats.

Do you have per-scanline prestepping then?

Scali · Aug 27, 2004

Do you have per-scanline prestepping then?

No

Nick · Aug 27, 2004

Scali said:
No

Do you have sub-texel accuracy at all? I assume so...

Then how do you do it?

Scali · Aug 27, 2004

If we disregard some early-out stuff, it seems that my setup code for the edge gradients has only two if-statements (one for each trapezoid) and 3 divisions (one for each edge).
All other code cannot be removed without breaking the rest of the gradients.

crystall · Oct 2, 2004

I have been lurking on the forum for quite a bit of time w/o posting but your thread (which I noticed only recently) was too interesting to miss it

.

I am working on a project similar to yours, I am implementing a tile-based software rasterizer with deferred and programmable rasterization and shading. The interface is a look-alike of what OpenGL 2.0 pure should have been...

Primitive rasterization is done with a method very similar to what you use (uniform rasterization with n*n fast block skipping). Since your project seems an immediate-mode rasterizer I suggest that you take the block-skipping approach to an higher level and make it hierarchical. Depending on polygon size start with say 128*128 blocks, then 32*32 and down to 8*8. This can help a lot with diagonal slivers like those generated by shadow volumes which would require quite a bit of work even when using 8*8 blocks. Also you could reuse this blocking scheme for a hierarchical z-buffer and do both in/out and z tests on the macro blocks before going down in the hierarchy.

Also I'd suggest that you keep your frame and z-buffers swizzled so that 8*8 blocks are stored in a linear way in memory. It will greatly help cache locality (especially for the z-buffer) and you can always deswizzle it when you blit to the actual framebuffer. BTW I assume that you keep your frame buffer in main memory and move it to the gfx card memory only when your done. In this case the deswizzling could easily be done in blit-time as blitting will be certainly limited by the AGP bandwidth.

I hope that you'll find my suggestions useful. BTW unless I missed someone you are the third person I know of which is seriously working on a software rasterizer (the other ones being Eric Bron and Michael Abrash

).

Nick · Oct 3, 2004

crystall said:
I have been lurking on the forum for quite a bit of time w/o posting but your thread (which I noticed only recently) was too interesting to miss it .

Cool.

Have you checked the article I made of this: Advanced Rasterization?

I am working on a project similar to yours, I am implementing a tile-based software rasterizer with deferred and programmable rasterization and shading. The interface is a look-alike of what OpenGL 2.0 pure should have been...

That sounds really interesting! What kind of deferred technique are you using? Just filling the depth-buffer first to avoid overdraw, or do you store more parameters? I'm planning to store DirectX calls, and extract the depth-independent render operations from them, then at Present render the whole scene as normal.

Primitive rasterization is done with a method very similar to what you use (uniform rasterization with n*n fast block skipping). Since your project seems an immediate-mode rasterizer I suggest that you take the block-skipping approach to an higher level and make it hierarchical. Depending on polygon size start with say 128*128 blocks, then 32*32 and down to 8*8.

My target resolution is 640x480, so I don't think 128x128 blocks are really useful.

There are situations where using 32x32 blocks was more optimal though. But to avoid complicating things (premature optimization), I'll stick with 8x8 blocks for now. Anyway, thanks for the hierarchyical processing idea!

This can help a lot with diagonal slivers like those generated by shadow volumes which would require quite a bit of work even when using 8*8 blocks. Also you could reuse this blocking scheme for a hierarchical z-buffer and do both in/out and z tests on the macro blocks before going down in the hierarchy.

Yeah, diagonal thin polygons are really bad for performance. In average though the algorithm is very succesful. I haven't implemented the hierarchical depth-buffer yet, but it looks promising.

Also I'd suggest that you keep your frame and z-buffers swizzled so that 8*8 blocks are stored in a linear way in memory. It will greatly help cache locality (especially for the z-buffer) and you can always deswizzle it when you blit to the actual framebuffer. BTW I assume that you keep your frame buffer in main memory and move it to the gfx card memory only when your done. In this case the deswizzling could easily be done in blit-time as blitting will be certainly limited by the AGP bandwidth.

Currently I keep the framebuffer in video memory (in fullscreen), to avoid the copy. But this is very bad for transparancy. So I definitely want to start using a third buffer, in system memory. Storing it in a swizzled format sounds like a great idea! I was already doing this with the z-buffer, but it will make things easier and faster for the framebuffer as well. Thanks!

I hope that you'll find my suggestions useful. BTW unless I missed someone you are the third person I know of which is seriously working on a software rasterizer (the other ones being Eric Bron and Michael Abrash ).

I've never heard of Eric Bron. Could you tell me a bit more about him and his renderer? Is Michael still working actively on Pixomatic?

Cheers,

Nicolas Capens

Scali · Oct 3, 2004

I suggest that you take the block-skipping approach to an higher level and make it hierarchical. Depending on polygon size start with say 128*128 blocks, then 32*32 and down to 8*8.

I suppose you could also apply this in a rectangular fashion? So you would just find the next power-of-two for width and height, to get the best fit, and do subdivision in width and height separately.

crystall · Oct 3, 2004

Nick said:
Cool. Have you checked the article I made of this: Advanced Rasterization?

Yes, very interesting, in fact I think I'm gonna rip off some of your ideas

Uh, BTW my project will also be released under the GPL though I don't feel like publishing the code until it is already semi-working.

That sounds really interesting! What kind of deferred technique are you using? Just filling the depth-buffer first to avoid overdraw, or do you store more parameters? I'm planning to store DirectX calls, and extract the depth-independent render operations from them, then at Present render the whole scene as normal.

I'm doing basically what the PowerVR family did in hardware (though with some quirks). I bin all the polygons which I recieve from the setup stage into a tile buffer storing them in the tiles they fall into. Later when flushing happens I draw all the polygons tile by tile. First I rasterize them obtaining the depth buffer, the stencil buffer and an index buffer which represents which fragments belong to which polygon, then I generate fragment coords based on this buffer and send them to the shaders. When everything is done I blit the tile to the actual frame buffer and start with the next one. For transparent polygons the process is similar though it must be repeated as with depth-peeling. BTW I support different rasterization modes: deferred but leading to results perfectly consistent with an immediate-mode rasterizer (to be OpenGL compliant), fully deferred with per-polygon order independent transparency and fully deferred with per-pixel OIT.

My target resolution is 640x480, so I don't think 128x128 blocks are really useful.

Hey, that's my target resolution too

Yeah, diagonal thin polygons are really bad for performance. In average though the algorithm is very succesful. I haven't implemented the hierarchical depth-buffer yet, but it looks promising.

Yeah, it could help quite a bit depending on the situations and won't add much to the rasterization cost, in fact it could even lower it.

Currently I keep the framebuffer in video memory (in fullscreen), to avoid the copy. But this is very bad for transparancy. So I definitely want to start using a third buffer, in system memory. Storing it in a swizzled format sounds like a great idea! I was already doing this with the z-buffer, but it will make things easier and faster for the framebuffer as well. Thanks!

Keeping the frame buffer in video memory is pretty nasty for performance, remember that AGP memory is usually mapped as not-cacheable. BTW blitting from main memory to video memory takes very little time (especially at low frame-rates

) so consider it seriously.

I've never heard of Eric Bron. Could you tell me a bit more about him and his renderer?

He's the author of Kribi.

Is Michael still working actively on Pixomatic?

I think so, wasn't it update to 2.0 some time back?

Nick · Oct 3, 2004

crystall said:
Yes, very interesting, in fact I think I'm gonna rip off some of your ideas

Let me guess: the sub-pixel accuracy and fill convention stuff?

I'm doing basically what the PowerVR family did in hardware (though with some quirks). I bin all the polygons which I recieve from the setup stage into a tile buffer storing them in the tiles they fall into. Later when flushing happens I draw all the polygons tile by tile.

Doesn't that become very complex when there are many polygons and render states? It also seems like a lot of things are done multiple times. Like transformation, lighting, clipping, rasterization setup, etc. Only polygons that completely fall into one tile don't get clipping into multiple parts. But at low resolution this seems rare.

First I rasterize them obtaining the depth buffer, the stencil buffer and an index buffer which represents which fragments belong to which polygon, then I generate fragment coords based on this buffer and send them to the shaders. When everything is done I blit the tile to the actual frame buffer and start with the next one. For transparent polygons the process is similar though it must be repeated as with depth-peeling.

I like the idea of an index buffer for the polygons! It immediately shows which polygons are visibile, without rasterization and z-tests.

BTW I support different rasterization modes: deferred but leading to results perfectly consistent with an immediate-mode rasterizer (to be OpenGL compliant), fully deferred with per-polygon order independent transparency and fully deferred with per-pixel OIT.

I considered this too one time, but when I computed the memory requirement it was gigantic. But when you do it in tiles it's probably very advantageous?

Keeping the frame buffer in video memory is pretty nasty for performance, remember that AGP memory is usually mapped as not-cacheable. BTW blitting from main memory to video memory takes very little time (especially at low frame-rates ) so consider it seriously.

I am. As long as I'm only writing, using video memory isn't bad, but as soon as I read back there's dozens of lost clock cycles. Even if I fetch them long beforehand it doesn't help. So a third buffer in system memory is definitely a necessity.

He's the author of Kribi.

I have some 'mixed feelings' about Kribi. It isn't capable of rendering a simple scene at decent framerate, but it's capable at rendering a seriously complex scene at several frames per second. So, it seems like he has a really powerful visibility determination algorithm, but it's too 'heavy' for simple scenes.

I think so, wasn't it update to 2.0 some time back?

Yes, version 2.0 is quite recent. It only has some relatively simple extensions though.

crystall · Oct 3, 2004

Nick said:
Let me guess: the sub-pixel accuracy and fill convention stuff?

Yeah, in fact up to now I just wrote a basic floating point rasterizer and didn't tested it much because I was still in the process of writing & testing the infrasctructure and I hadn't noticed the 'missed pixels' problem, so I'll probably switch entirely to your method. The only problem I have with it is that I have to tweak some of my code as a pure integer code might be slower than the actual one on my hardware (dual 1 GHz G4). Due to the possibility of issuing floating point vector operations together with integer ones and permutes I had a loop for the most basic situation (depth test on with any depth function, depth buffer and color buffer updates) with in/out polygon testing which rasterized two pixels every three cycles. I'm gonna miss it

. That's a good thing anyway since for all other platforms (older G4s, G3s, 970s and all x86 processors) it will be faster with pure integer math.

Doesn't that become very complex when there are many polygons and render states?

No, every polygon is binned with an header of 64 bytes + 12 bytes per each 'varying' + 8 bytes per each tiles it falls into which does not require much bandwidth unless you are throwing a lot of polygons inside the rasterizer in which case you will probably hit other limits. The state is kept very small, constant data is shared among every polygon which uses it and not duplicated.

It also seems like a lot of things are done multiple times. Like transformation, lighting, clipping, rasterization setup, etc. Only polygons that completely fall into one tile don't get clipping into multiple parts. But at low resolution this seems rare.

I don't do transformation as mine is a 'pure' rasterizer, I take already projected polygons, I will plug the geometry pipeline on it later. As far as clipping goes it happens only once per polygon during setup (which is also once per polygon), tile binning requires something like 4 instructions per tile which is very fast as most polygons fall into just 1 to 4 tiles. Uh, BTW I forgot to mention that setup, rasterization and shading can be spread over any number of parallel threads

. A very nice side-effect of tile-based methods is that they are inherently parallel.

I like the idea of an index buffer for the polygons! It immediately shows which polygons are visibile, without rasterization and z-tests.

Maybe I didn't explain it very well, the index buffer is generated by the rasterization process (including depth testing), after all the polygons have been processed I've got a buffer telling me which fragment belongs to which polygons which I use as an input for the shaders.

I considered this too one time, but when I computed the memory requirement it was gigantic. But when you do it in tiles it's probably very advantageous?

It is. It requires only a little bit of extra bandwidth (and probably from the caches as the tile buffers hardly exceed 128k usually) obviously per-pixel OIT is significantly slower than immediate mode or per-polygon OIT. Naturally I will optimize it later as it is not a priority.

I have some 'mixed feelings' about Kribi. It isn't capable of rendering a simple scene at decent framerate, but it's capable at rendering a seriously complex scene at several frames per second. So, it seems like he has a really powerful visibility determination algorithm, but it's too 'heavy' for simple scenes.

I don't know much about its internals though I chatted with Eric from times to times (you can spot him on Ace's Hardware usually) but he seems very skilled.

MfA · Oct 3, 2004

crystall said:
As far as clipping goes it happens only once per polygon during setup (which is also once per polygon)

So do you just take some extra steps with incremental rasterization to get into the tile, same as with guard-band clipping on direct mode rasterizers, or do you do the same as PowerVR? (Computing intersections for every pixel independently for the entire tile.)

AFAIK Kribi is a semi-stochastic raytracer BTW.

crystall · Oct 5, 2004

MfA said:
So do you just take some extra steps with incremental rasterization to get into the tile, same as with guard-band clipping on direct mode rasterizers, or do you do the same as PowerVR? (Computing intersections for every pixel independently for the entire tile.)

I do something similar to what PowerVR does but not the same. First of all I extract the polygon bounding rectangle and clamp it to the viewport or the scissor (if scissoring is on). Then if it falls into more than four tiles I walk thru them checking the half-plane equations at the tile boundaries adding it only into the tiles which it belongs too. If the polygon falls into 4 tiles or less I simply add it. Later during the rasterization process clipping is implicit (a side effect of uniform rasterization) as I rasterize only the pixels inside the clipped/scissored triangle's bounding rectangle.

AFAIK Kribi is a semi-stochastic raytracer BTW.

I didn't knew that, Eric is really tight lipped about his project usually

Scali · Oct 5, 2004

the index buffer is generated by the rasterization process (including depth testing), after all the polygons have been processed I've got a buffer telling me which fragment belongs to which polygons which I use as an input for the shaders.

You mean your index-buffer maps 1:1 to the pixelbuffer/zbuffer/etc?
So you can theoretically have a different shader for each pixel?
I've seen a few implementations that use a tree for each scanline, which stores the spans of triangles. This basically replaces the zbuffer altogether, and in most cases you'll have spans of (much?) more than 1 pixel on a scanline, so it's more efficient.
But perhaps you chose for a per-pixel buffer because you expect that there will be too many triangles to make a tree-approach efficient?

Advanced Rasterization

Similar threads