Fast software renderer

you can do it the 3dfx way (as long as the rendering is simple enough) : give each other scanline to each thread, or divide by bands of pixels as done on VSA/100 (the number of pixels was configurable, it was maybe 8 by default)

The scaling was great, and a voodoo5 could take on ut2003 with a good CPU. (it was maybe limited by triangle setup when there's much geometry)

I wonder which techniques prevent that method from working with more modern rendering. (shaders that want to be aware of the whole scene or of a pixel in another band?)

It worked for Voodoo cards because you did all T&L on the CPU (which was still a singlecore at the time).
You could basically just send the same setup info to both cards, with one doing the even scanlines and the other doing the odd scanlines.
It didn't exploit any parallelism in the T&L part of the rendering process.
 
Can you rephrase what you meant to say? Because I'm not quite sure what to make of it. Clearly hardware rasterizers are not the same as software rasterizers... and even then, rasterizers aren't something that a programmer has control over... so what do you mean?

I just meant to say that with GPUs it would not be possible to do scanline rendering as you can do on a CPU, because GPUs render in rasterized triangle order.
 
I just meant to say that with GPUs it would not be possible to do scanline rendering as you can do on a CPU, because GPUs render in rasterized triangle order.

Well, PowerVR had slightly different ideas on rasterizing than most GPU makers :)
 
Well, PowerVR had slightly different ideas on rasterizing than most GPU makers :)

Per tile it still does triangle order rasterization, with deferred shading if I'm not mistaken.
The Nintendo DS is the only recent 'GPU' doing true scanline rendering as far as I know.
 
Per tile it still does triangle order rasterization, with deferred shading if I'm not mistaken.
The Nintendo DS is the only recent 'GPU' doing true scanline rendering as far as I know.

What do you mean with 'scanline rendering' in this context? Or with 'triangle order rasterization' for that matter.
I think we need to define what we're discussing first.
 
Let's not stumble over terminology. Here a reference about scanline rendering:

http://en.wikipedia.org/wiki/Scanline_rendering

I know what scanline rendering is, I just didn't get what you were trying to say (yea, modern GPUs don't use scanline rendering... we know that... but why bring it up in the first place? Doesn't make sense).
I suppose you didn't understand that I was talking about more than just rasterizing, and I didn't understand that you seemed to ignore everything BUT rasterizing.

I was talking about rendering different types of games in software. Although it could partially be extended to hardware, because hardware has evolved to deal with the demands of games, and vice versa.
Eg, the Quake I engine was nice for software T&L, but the approach didn't extend to hardware T&L very well. So that has nothing to do with rasterizing in the strictest sense. There are many more factors obviously.
 
Last edited by a moderator:
I was talking about rendering different types of games in software. Although it could partially be extended to hardware, because hardware has evolved to deal with the demands of games, and vice versa.
Eg, the Quake I engine was nice for software T&L, but the approach didn't extend to hardware T&L very well. So that has nothing to do with rasterizing in the strictest sense. There are many more factors obviously.

For different types of games, or parts of games, you usually would make an engine, or sub engine that is tuned to the environment. For example indoor Quake style, big outdoor environment, fly simulator, space game etc. But this type of engine is in my opinion more dictated by the environment you want to render than if you want to render it in hardware or software.

For Quake I rasterizing is the dominant factor, because there really in not much lighting going on it's basically just texture mapping.
For more recent games the focus for sure has shifted to nicer pixels with all kinds of shader techniques. Next the focus will probably shift to more detailed geometry with tessellation and displacement mapping. (see my xvox demo for that matter http://users.belgacom.net/xvox/)
Would you want to do this with scanline rendering, probably not. (unless you want to do raytracing as this also is a form of scanline rendering)
Volume rendering also has interesting possibilities.
 
For different types of games, or parts of games, you usually would make an engine, or sub engine that is tuned to the environment. For example indoor Quake style, big outdoor environment, fly simulator, space game etc. But this type of engine is in my opinion more dictated by the environment you want to render than if you want to render it in hardware or software.

I'm not sure if I agree with that. With hardware you don't have too much of a choice... basically you can either use deferred rendering or immediate rendering... but other than that you don't have too many options. So most of your engine is more about determining visible sets and sending them to the hardware efficiently.

With software however, there are many rendering methods to choose from, and some of them can be made much more efficient when you design your engine around them.

So I'm not sure if the order should be see as: environment -> engine -> renderer
Or perhaps: environment -> renderer -> engine
I think it's more like: environment -> (renderer+engine)
It's not always entirely clear where the distinction between renderer and engine is, in a software approach.

For Quake I rasterizing is the dominant factor, because there really in not much lighting going on it's basically just texture mapping.

Yes, but I think that is a result of the hardware they had to work with, not a result of the type of environment/game they chose.
They just knew that Quake had to run on a Pentium in software, so there wasn't a lot of room for high-poly models or fancy lighting. Low-poly single-texturing was pretty much the only thing the Pentium could do in realtime. Hence they designed their environment around it.
Early 3D accelerators were not capable of much more either, but I doubt that they put much thought into running on hardware accelerators during the development of Quake.

So I think Quake 1 is predominantly rasterizer-bound because it was designed that way, not because of the type of game/environment in general. A sign of the times.
 
I've just included a 64-bit version.
This runs 15% faster compared to 32-bit, getting 750 Mpix/s now :smile:

That's pretty cool...
So where did the 15% come from?
Did you write the SSE code with intrinsics and just recompiled it? Or did you have to do some hand-optimization to get the 15% gain in 64-bit?
 
That's pretty cool...
So where did the 15% come from?
Did you write the SSE code with intrinsics and just recompiled it? Or did you have to do some hand-optimization to get the 15% gain in 64-bit?

Indeed the code is written with intrinsics. So basically a recompile is all that was needed.
In 64-bit mode you have twice as many registers and this helps nicely to reduce the number of instructions and memory references.
 
Actually I first made this renderer over ten years ago, now this version runs 200 times faster. So it is ancient technology revamped with modern techniques.

Ahh thanks. So its old rendering technology essentially? Interesting stuff!

That was a needlessly harsh comment. Can you do any better?
Try at least to be constructive, when giving negative feedback.

CC

I was actually referring to the speed of the X86 processors. I'll be nicer to intel next time.
 
Indeed the code is written with intrinsics. So basically a recompile is all that was needed.
In 64-bit mode you have twice as many registers and this helps nicely to reduce the number of instructions and memory references.

That's nice. Which compiler did you use?
I recall back in the early days of SSE that I tried the intrinsics of both MSVC and the Intel C/C++ compiler, but neither were very good at the time.
You still had to do a lot of manual unrolling and pre-optimization to get the expected results.
So I just continued to use handwritten assembly.
But I haven't done any SSE in recent years, so I don't know if the situation with intrinsics has improved much. I do use 64-bit mode these days, and I've seen similar speedups on some of my code aswell. Could also have been related to SSE-code, since my algorithm used quite a bit of floating-point arithmetic, just not of the SIMD-kind. But the compiler will still generate SSE code for it.
 
That's nice. Which compiler did you use?
I recall back in the early days of SSE that I tried the intrinsics of both MSVC and the Intel C/C++ compiler, but neither were very good at the time.
You still had to do a lot of manual unrolling and pre-optimization to get the expected results..

The 64-bit version is compiled with VS2008, the 32-bit version is done with VS2005. For my application both generate code that runs at the same speed.
I switched to VS2008 as it allows me to experiment with SSE4, the current versions use SSE2. My first experiments seem to indicate that SSE4 would not result in big improvements.

The generated code via intrinsics looks quite good, at least there are no unnecessary moves. I don't know if the order of instructions is optimal, but with out of order execution it probably doesn't matter much.
 
I've just extended the renderer with a DX10 version. Also a DX10 WARP software rendering version is included.
On my GTX 280, the DX10 GPU version runs about 6 times faster as the native FQuake software renderer. The DX10 WARP software renderer runs about 5 times slower compared to my renderer. I even made it a little more WARP friendly by using a float Z buffer, a normal 24 bit Z buffer would be 25% slower.
For fair comparison, the same single texture layer texturing technique is used for all renderers.
 
Back
Top