Fast software renderer

Voxilla

Regular
Here a rather fast software renderer engine demo called FQuake.
It renders some level of the original Quake game.

http://users.skynet.be/fquake/


This demo engine is highly optimized making use of multiple threads and SSE code.
Perspective, bilinear texture mapping runs at 650 Mpix/s on a Quad Core 2, 3.2Ghz.

The engine makes use of an algorithm that ensures zero overdraw.
At 2560x1600 resolution the engine runs at between 120 and 160 frames per second, only slightly depending on scene complexity, corresponding to between 500 and 650 Mpix/s texture and pixel fill rate.

The mapped texture consists of two layers, a material and light map. Though normally you would do this with multi texturing, the engine does it with single texturing. To make this possible a LRU texture cache is maintained with on the fly composted material/light texture maps as needed.

To make optimally of all CPU cores, the screen is split up according to the number of cores. The splitting positions of the screen are continuously moved to adapt to the scene complexity so that all cores are maximally loaded.

You can fly through the scene, clicking the mouse left and right buttons for forward and backward movement. Holding the middle button with left/right causes quad speed.
You can also switch between bilinear and point sampling, by pressing the space bar.
 
There was an issue preventing statistics to show up in the title bar, this is fixed now.

I may also make some DirectX versions to see how much faster a GPU would be or to compare with DirectX software renderers such as SwiftShader or WARP.
 
Is that about all you could do with that level of performance and complexity on an X86 GPU? It seems rather primitive, though I haven't thought about software renderers since well Unreal 1 days really.
 
Is that about all you could do with that level of performance and complexity on an X86 GPU? It seems rather primitive, though I haven't thought about software renderers since well Unreal 1 days really.

Actually I first made this renderer over ten years ago, now this version runs 200 times faster. So it is ancient technology revamped with modern techniques.
 
Is that about all you could do with that level of performance and complexity on an X86 GPU? It seems rather primitive, though I haven't thought about software renderers since well Unreal 1 days really.

That was a needlessly harsh comment. Can you do any better?
Try at least to be constructive, when giving negative feedback.

CC
 
I guess i'm missing something but I thought Quake was originally software rendered already? Cool to see this kind of thing though, It really interesting to see how far CPU's have come over the years in relation to rendering graphics.
 
Have you checked how fast the Quake 1 rasterizer works on the same test machine? This would be an interesting comparison ...
 
I guess i'm missing something but I thought Quake was originally software rendered already?
Yes, but it didn't do bilinear filtering. So Jan's renderer looks more like what Quake looked like with hardware rendering in its days, but at far higher resolution and framerate. :)
Anteru said:
Have you checked how fast the Quake 1 rasterizer works on the same test machine? This would be an interesting comparison ...
It only lets me select 1280x1024 resolution and point filtering, but it caps at 60 FPS (vsync). Anyway the original Quake spends only a few clock cycles on per-pixel work so it's constantly bandwidth limited. You can do bilinear filtering and such 'for free' before being compute limited, and by using SSE and multi-core you get even more possibilities.
 
It only lets me select 1280x1024 resolution and point filtering, but it caps at 60 FPS (vsync).


Hmm, on my dual-dual-xeon with 3.0 GHz, I get like 186 fps @ 1280x1024 withFQuake, with point filtering -- and 120 with bilinear. Which is by no means bad, but I thought that a rasterizer on a CPU can be much faster than that. Especially after I played around a bit with DX10 WARP, I'm actually pretty impressed how fast a SW rasterizer can be.

Is this due to being limited by bandwidth? Thread-synchronization? Other issues? If so, then maybe a different test scene would be interesting, as it's hard to judge how fast this really is.

@Nick: Do you have any peak numbers for SwiftShader2 -- that is, how fast you can render like two textured triangles spanning the whole screen? Just to get an idea where the bottlenecks are.
 
Have you checked how fast the Quake 1 rasterizer works on the same test machine? This would be an interesting comparison ...

I just checked. Nick, with typing timerefresh it can run at full speed. On my system, the original Quake runs about 200 frames per second, at 12820x1024 at some place in the middle of the first hall.
At the same place with my renderer and point sampling I'm getting 250 frames per second at 2560x1600. So I'm about 4 times faster. In fact I seem to be limited by the 4GB/s PCIe bus. Another difference is that the original does everything with 8 bit per pixel /texel where I'm doing 32-bit rgba.
 
Hmm, on my dual-dual-xeon with 3.0 GHz, I get like 186 fps @ 1280x1024 withFQuake, with point filtering -- and 120 with bilinear. Which is by no means bad, but I thought that a rasterizer on a CPU can be much faster than that. Especially after I played around a bit with DX10 WARP, I'm actually pretty impressed how fast a SW rasterizer can be.

Is this due to being limited by bandwidth? Thread-synchronization? Other issues? If so, then maybe a different test scene would be interesting, as it's hard to judge how fast this really is.

@Nick: Do you have any peak numbers for SwiftShader2 -- that is, how fast you can render like two textured triangles spanning the whole screen? Just to get an idea where the bottlenecks are.

Are these Xeons P4 or Core, the Cores are much faster ? With two screens it is slow for the moment, you have to disable one of them. If you have Vista and no Aero then also it can be much slower. To get optimal speed you also need PCIe gen 2.

Give me some time and I'll make a DX10 version, so we can see how WARP would do.
 
@Nick: Do you have any peak numbers for SwiftShader2 -- that is, how fast you can render like two textured triangles spanning the whole screen? Just to get an idea where the bottlenecks are.
SwiftShader's bottlenecks are a bit different. To start with it renders pixel quads to support mipmapping of indirect texturing, so a fair bit of time is spent on transposing data. It's mostly compute limited.

Also, texture sampling/filtering is not the biggest bottleneck any more with modern games. Switching to point filtering typically only wins you 5-10%. So I actually use game benchmarks much more often than synthetic benchmarks, to find the real bottlenecks...
 
Are these Xeons P4 or Core, the Cores are much faster ? With two screens it is slow for the moment, you have to disable one of them. If you have Vista and no Aero then also it can be much slower. To get optimal speed you also need PCIe gen 2.

These are Core2 based Xeons, and I had my second screen disabled (otherwise, it wouldn't even start). The graphics card is a GTX 280, so it's surely PCIe gen 2? Running on XP, will try on Vista.

Regarding 4 times faster: Well, if your rasterizer is multithreaded, and you have 4 cores, I would expect at least 4 times the performance of the Quake rasterizer. However, I thought that even on a single core, we can do better now these days using SSE and other optimizations -- surely the Quake 1 code does not take the best advantage of current-gen CPUs.
 
Yea, if you want to render Quake I as quickly as possible, you take an entirely different approach than when you want to render a modern game as quickly as possible.

So my question is: what is the goal of this engine? Do you just want to have a super-fast Quake engine, or do you want to have more modern shading aswell?

Also, can you tell a bit more about the internals of the engine? You say it has zero overdraw, is this done with a spanbuffer approach (another choice which depends a lot on the type of scenes you want to render)?

And the screen split-up, how is this done? With just horizontal 'bands' for each core, or with interleaved scanlines, checkerboard patters... etc?
Does this only apply to the rasterizing stage, or do you also do something smart with multiple cores during transform and lighting? Etc... :)
 
how funny that point filtering and bilinear look so similar. Crisp and a bit noisy both ways, that must be quite a low texture LOD.
I like it!, it may be better than the "3dfx blur" I was expecting.

On a single core (sempron @ ~2.3GHz) I get roughly 30 to 45 megapixels/s with bilinear and 60 to 100 megapixels point filtered. (1024x768 res.)

I would love your engine to run a counterstrike map (did the quick stupid job of copying and renaming cs_italy.bsp but it doesn't work). I still long for an open source clone of the original half-life / counterstrike 1.5
I can sense a niche for your engine : the chinese Loongson 3 processor ;).
 
Last edited by a moderator:
Yea, if you want to render Quake I as quickly as possible, you take an entirely different approach than when you want to render a modern game as quickly as possible.

With a GPU the approach would be the same. As the rasterizer is fixed function there is not much choice anyway. With CPU rendering there are plenty of possibilities.


So my question is: what is the goal of this engine? Do you just want to have a super-fast Quake engine, or do you want to have more modern shading aswell?

The goal was just updating an age old engine I wrote 12 years ago and see how much CPUs can be pushed nowadays.
I currently have no intention to turn it into something more. Applying it to more recent versions of Quake would be interesting though.


Also, can you tell a bit more about the internals of the engine? You say it has zero overdraw, is this done with a spanbuffer approach (another choice which depends a lot on the type of scenes you want to render)?

It uses a (parallelized) scanline algorithm, which has numerous advantages for CPU rendering.
For huge amounts of small triangles it has relative higher overhead, but there might be ways to stretch it more than one would think.


And the screen split-up, how is this done? With just horizontal 'bands' for each core, or with interleaved scanlines, checkerboard patters... etc?

The split up is with horizontal bands. Other ways like tiles are possible too with little overhead. First I tried fixed size interleaved small bands. I thought this would work best, but it didn’t. Having large adaptive sized bands turned out to be faster. My take is that this works better for the CPU caches. This way textures used by one core rarely end up in the cache of another core that doesn’t need it.


Does this only apply to the rasterizing stage, or do you also do something smart with multiple cores during transform and lighting? Etc... :)

As the amount of geometry is fairly light this is done single threaded.
 
With a GPU the approach would be the same. As the rasterizer is fixed function there is not much choice anyway. With CPU rendering there are plenty of possibilities.

Well, GPUs have used various different approaches over the years, from actual fixedfunction single-textured rendering in the days of Quake 1 to today's unified shader architectures (which obviously can still run Quake 1).
For GPUs the tradeoff is just different.

The goal was just updating an age old engine I wrote 12 years ago and see how much CPUs can be pushed nowadays.
I currently have no intention to turn it into something more. Applying it to more recent versions of Quake would be interesting though.

Quake II has a very similar engine, so that shouldn't be too much of a problem, I guess. I'm not sure what Quake III does exactly, and Quake IV is totally different ofcourse, you'd get into Nick's territory there, with SwiftShader :)

I recall back at university when Ewald and I were playing around with Quake I rendering aswell. One thing Ewald had done was dynamic RGB lighting, was pretty cool actually. It would dynamically update the lightmaps by projecting the lightsource into them, and still use the Quake-style pre-cached light*texture rendering.

It uses a (parallelized) scanline algorithm, which has numerous advantages for CPU rendering.
For huge amounts of small triangles it has relative higher overhead, but there might be ways to stretch it more than one would think.

Yea, ever seen the Amiga demo "When We Ride On Our Enemies" by Skarla?
It uses an engine very similar to the Quake 1 engine, but on Amiga. The Amiga had VERY low memory bandwidth, so the spanbuffer + BSP approach was an excellent solution there.

But these days, with far higher polycounts, a simple z-buffer may be better, because it has no extra per-triangle overhead. I'm not sure where the exact transition point would lie though.

As the amount of geometry is fairly light this is done single threaded.

Yea, it probably wouldn't make much of a difference for a Quake level.
I've been thinking about how to best multithread a software renderer, but I suppose the best efficiency would come from alternate frame rendering, like SLI/CrossFire.

Trying to divide up the workload on a level lower than a complete frame will require extra overhead, no matter how you slice it. And you'll also run into more complex load balancing issues.
 
Well, GPUs have used various different approaches over the years, from actual fixedfunction single-textured rendering in the days of Quake 1 to today's unified shader architectures (which obviously can still run Quake 1).
For GPUs the tradeoff is just different.

With rasterizer I mean the thing that turns geometry into pixels. So the thing that tells you which pixels to shade.
For the last decade I've been doing GPU rendering, so I know how they have evolved.
 
With rasterizer I mean the thing that turns geometry into pixels. So the thing that tells you which pixels to shade.

Well, I'm used to the term 'rasterizing' being used either as the rasterizing process itself, or the rasterizing and the actual outputting of the pixels.
In the context of the quote you responded to, I naturally assumed you meant the latter, since I clearly was talking about rendering/shading/multitexturing.

Can you rephrase what you meant to say? Because I'm not quite sure what to make of it. Clearly hardware rasterizers are not the same as software rasterizers... and even then, rasterizers aren't something that a programmer has control over... so what do you mean?
 
Yea, it probably wouldn't make much of a difference for a Quake level.
I've been thinking about how to best multithread a software renderer, but I suppose the best efficiency would come from alternate frame rendering, like SLI/CrossFire.

Trying to divide up the workload on a level lower than a complete frame will require extra overhead, no matter how you slice it. And you'll also run into more complex load balancing issues.

you can do it the 3dfx way (as long as the rendering is simple enough) : give each other scanline to each thread, or divide by bands of pixels as done on VSA/100 (the number of pixels was configurable, it was maybe 8 by default)

The scaling was great, and a voodoo5 could take on ut2003 with a good CPU. (it was maybe limited by triangle setup when there's much geometry)

I wonder which techniques prevent that method from working with more modern rendering. (shaders that want to be aware of the whole scene or of a pixel in another band?)
 
Back
Top