Fast software renderer

Well, that's interesting...
I suppose the GTX280 isn't quite memory-limited...
Your PC will have a memory bandwidth of what... close to 10 GB/s? The GTX280 has far more than 60 GB/s (about 142 GB/s), but I suppose the single-texturing nature means that it can't get anywhere near its theoretical limits. And the REAL power of the GTX280, its ALUs, will just be sitting idle :)

I suppose the WARP renderer also isn't really aimed at maximizing performance on single-texturing, because that's a rare case in modern DX9/DX10 applications.

I suppose it mainly proves that your renderer is custom-made for Quake, while WARP and the GTX280 aren't :)
 
Well, that's interesting...
I suppose the GTX280 isn't quite memory-limited...
Your PC will have a memory bandwidth of what... close to 10 GB/s? The GTX280 has far more than 60 GB/s (about 142 GB/s), but I suppose the single-texturing nature means that it can't get anywhere near its theoretical limits. And the REAL power of the GTX280, its ALUs, will just be sitting idle :)

You are right, the GPU rendering should go faster than it does. After some investigation it appears the rendering to a window causes it to slow down. When rendering to full screen, suddenly the rendering goes twice as fast. I'll make some adaptations. So this would make GPU 12 times faster as my renderer and 60 times faster as WARP. For WARP the rendering speed remains unchanged in full screen.


I suppose the WARP renderer also isn't really aimed at maximizing performance on single-texturing, because that's a rare case in modern DX9/DX10 applications.

I'm quite sure WARP rendering speed would instantly half with dual texture mapping. Filtered texture sampling is very demanding for a CPU. On a GPU it would not make much difference.
 
I'm quite sure WARP rendering speed would instantly half with dual texture mapping. Filtered texture sampling is very demanding for a CPU. On a GPU it would not make much difference.

Depends on how they handled parallelism. Perhaps they structured the code generation to be able to filter multiple textures in parallel (which would generally be a common case in DX9/DX10 code).
Otherwise I don't really see why it would be so much slower than your solution. If it was optimized for single-texturing, it should have gotten pretty close, I would think.
 
Depends on how they handled parallelism. Perhaps they structured the code generation to be able to filter multiple textures in parallel (which would generally be a common case in DX9/DX10 code).
Otherwise I don't really see why it would be so much slower than your solution. If it was optimized for single-texturing, it should have gotten pretty close, I would think.

I just did a small test by adapting the pixel shader to sample the same texture at coordinates uv and 2*uv, and then outputting the average of the two samples.
The frame rate drops from 24 to 14 fps. And this is still with sampling the same texture and similar coordinates.
My first version was also about three time slower as my last version, with careful tuning and some smart tricks miracles are possible as you will know :smile:
 
With the current DX10 versions it actually is already possible to switch to full screen and back by pressing ALT + Enter. To measure frame rate in full screen, a tool like Fraps is needed. So you can see full screen is a bit faster with the GPU compared to windowed.
 
My first version was also about three time slower as my last version, with careful tuning and some smart tricks miracles are possible as you will know :smile:

The same goes for hardware, though :)
The original BSP approach where you'd get a list of polygons from back to front was nice for software T&L solutions... but hardware T&L wanted larger batches to get better efficiency... so a 'leafy' BSP worked better there, since it was based around static batches of geometry, rather than rebuilding a polygon list every frame.
 
I just did a small test by adapting the pixel shader to sample the same texture at coordinates uv and 2*uv, and then outputting the average of the two samples.
The frame rate drops from 24 to 14 fps. And this is still with sampling the same texture and similar coordinates.
I'm afraid that's really not representative of the characteristics of modern games. Texture sampling is still an expensive operation to perform on a CPU, but it's no longer dominating performance. The average number of texture sample operations per pixel only went up slightly, while the arithmetic work in a shader increased massively. Also, because of high polygon environments and characters the time spent on vertex shading, gradient setup and rasterization became very substantial.

So while you've created one heck of a fast Quake renderer, it would be unfair to compare it directly to other software renderers capable of running much more advanced games. Their architecture differs significantly to be able to support things like mipmapping on indirect textures, z-buffer based occlusion culling, massive texture sizes, various color formats, high accuracy perspective correction and filtering, various addressing modes, etc. You have to keep software design and management restrictions in mind too. Supporting countless combinations of advanced features means you have to make a few compromises to the raw performance of plain texture sampling, or you'll end up with an unmanageable complexity and have a hard time releasing a quality product.

Anyway, I love your work. It proves that CPU's have made major leaps in performance and you just have to make good use of SIMD and multi-core. It also shows that when you're not bound by an API you can make things run significantly faster. This is also true for future GPU's as they become more and more like generic compute devices. It's still inevitable that at some point they'll merge entirely with the CPU. The only point of debate seems to be what the texture samplers will look like. In my humble opinion, scatter/gather units can perfectly fulfill that role and speed up many other parts of the classic 3D rendering pipeline while also being fully generic for any other uses...
 
Last edited by a moderator:
Big deja-vu with that post, Nick :)
A few years ago we had pretty much the same discussion, where my Java renderer was also not bound by an API (although it was still far closer to D3D and far more flexible than a Quake renderer), unlike SwiftShader... so I could pull some tricks to get the same or even better performance in some cases, despite Java being slower.
Back then we also talked about implementing a raytracer as backend for D3D/OpenGL, but I said it would be a very poor idea performance-wise because of the API limitations.
 
I'm afraid that's really not representative of the characteristics of modern games. Texture sampling is still an expensive operation to perform on a CPU, but it's no longer dominating performance.

I'm not so sure. Some modern techniques like parallax mapping, fur rendering, subsurface scattering and volumetric effects like realistic fire, smoke, cloud etc are very taxing on texture mapping capabilities, so much that they can overwhelm even current GPUs.


IThis is also true for future GPU's as they become more and more like generic compute devices. It's still inevitable that at some point they'll merge entirely with the CPU.

If they will merge, meaning become one and the same thing, I doubt. They certainly will fuse meaning both present on the same die. For the high end keeping them discrete will still last for a long time, be it alone for the sake of cooling and memory bandwidth.


The only point of debate seems to be what the texture samplers will look like. In my humble opinion, scatter/gather units can perfectly fulfill that role and speed up many other parts of the classic 3D rendering pipeline while also being fully generic for any other uses...

Agreed, scatter/gather is very desirable. It's hard to believe that souped up SSE, AVX will not have it. While it may help software texture mapping, hardware can do it much more efficiently at little die cost.
 
I'm not so sure. Some modern techniques like parallax mapping, fur rendering, subsurface scattering and volumetric effects like realistic fire, smoke, cloud etc are very taxing on texture mapping capabilities, so much that they can overwhelm even current GPUs.
I've done lots of experiments on actual games where I disabled certain features like texture sampling (replacing it with a single color), to determine the actual time spend on it. None of them was spending more than 25% on texture sampling. Also, switching from bilinear filtering to point filtering typically only gives me a 10% speed increase.

The effects you mention don't really dominate the entire scene. And in fact most of them are quite ROP heavy, which is the reason why even modern GPUs can have some trouble with them.
If they will merge, meaning become one and the same thing, I doubt. They certainly will fuse meaning both present on the same die. For the high end keeping them discrete will still last for a long time, be it alone for the sake of cooling and memory bandwidth.
Sure, the high-end will remain discrete. But for the low-end it's the logical next step. Once the CPU and GPU are on the same die it becomes immediately apparent that many components are duplicate. Today's CPUs reach 100 GFLOPS, and with AVX that will double. That's plenty for the arithmetic processing needs of low-end graphics.
Agreed, scatter/gather is very desirable. It's hard to believe that souped up SSE, AVX will not have it. While it may help software texture mapping, hardware can do it much more efficiently at little die cost.
The remaining bottlenecks for software rendering are not just texture sampling, but also rasterization and transcendental functions, all of which can be greatly accelerated with scatter/gather instructions.

Let's not forget that a unified CPU + IGP would be for people who don't consider themselves gamers. So it's not that big a deal if graphics don't run at the highest possible efficiency. But merely by adding scatter/gather units there would be no need for wasting dedicated area on graphics (and note that scatter/gather helps multimedia and such as well).

When I won my Core i7 system, I got into a conversation with the store manager and he told me the best selling system was the cheapest one, costing 260 €, featuring a Core 2 Duo processor. Adding in the cheapest discrete card made it cost 300 €, which nobody bought. This shows that in a market where margins run into the single digits, every way to reduce cost further counts. If anyone could create an IGP that costs less but still offers adequate 3D support, they'd do it. But pretty soon, CPUs will be very capable of that...
 
I've done lots of experiments on actual games where I disabled certain features like texture sampling (replacing it with a single color), to determine the actual time spend on it. None of them was spending more than 25% on texture sampling. Also, switching from bilinear filtering to point filtering typically only gives me a 10% speed increase.

Well doom3 tech is supposedly using 20 texture passes, and they are all over the screen, so this must have some impact, even HalfLife 2 had quite a few of layers.
 
Let's not forget that a unified CPU + IGP would be for people who don't consider themselves gamers. So it's not that big a deal if graphics don't run at the highest possible efficiency. But merely by adding scatter/gather units there would be no need for wasting dedicated area on graphics (and note that scatter/gather helps multimedia and such as well).

I think the biggest problem is not the hardware, but rather the revolution it requires in terms of driver and software support.
It's hard enough to get something like Cuda/OpenCL/etc adopted.
So you could add nice 'graphics' features to your CPU, and even re-use them for the IGP functions of the same chip... but if no software will be using such CPU extensions anyway, why would you even bother trying to integrate the two? In fact, it seesm to be a bit of a catch-22... if this functionality is on the GPU anyway, and you can use something like OpenCL from your application... why would you even want it in your CPU? When using OpenCL, it really doesn't matter where the actual code is being run... whether it's on a CPU, IGP, discrete GPU... who cares? We're starting to see this already... For years we bought fast CPUs for 3d rendering, video encoding, physics processing etc... now that GPUs are starting to take over these tasks, who is going to care how well a CPU does them? If the GPU is faster anyway, you might aswell just leave the functionality out of the CPU to make it simpler, cheaper and faster at the things it DOES do better than a GPU.
I think it will be a long time before CPUs with integrated graphics are more than just an IGP circuit copy-pasted onto the CPU die.

I also think it wil be a LONG time until a CPU core and a GPU core are going to be virtually identical.
I mean, if you look at Larrabee... while it is built on x86 technology, it doesn't resemble a Core2 or Core i7 in any way. With it being so different from x86 CPUs, I don't see Intel being able to merge the functionality of their GPUs with their CPUs anytime soon.
If it was just a case of taking a regular x86 CPU and adding some of the SIMD extensions that Larrabee receives, then that's what Intel would have done, but obviously that was not the way to go (you can say that 200 GFLOPS is enough for IGP-like graphics... but that only goes when you're talking about an actual IGP, which has more parallelism and various fixed-function units making it far more efficient than a CPU with a few scatter/gather extensions).
And it's not just a high-end thing either, because Intel announced that their IGPs will be based on Larrabee... so it sounds like they'll just scale down Larrabee and copy-paste that onto their CPUs.

In fact, I wonder if that gap can ever be bridged... it's a case of a few complex execution cores aimed at maximum serial performance vs a case of many simple execution cores aimed at maximum parallel performance. Sounds like Amdahl to me, you just need both types of cores, since software is built up from both types of code.
 
Last edited by a moderator:
The effects you mention don't really dominate the entire scene. And in fact most of them are quite ROP heavy, which is the reason why even modern GPUs can have some trouble with them.

Here another semi one-liner:
The more smart algorithms use miulti-texturing or raycasting, so virtually no ROP is needed. Cloud can be everywhere as can be parallax mapped walls and floors.
 
In fact, I wonder if that gap can ever be bridged... it's a case of a few complex execution cores aimed at maximum serial performance vs a case of many simple execution cores aimed at maximum parallel performance. Sounds like Amdahl to me, you just need both types of cores, since software is built up from both types of code.

I very much think the same. For fast few threaded CPU execution you need costly branch prediction, out of order and in the case of x86 overly complex instruction decoders resulting in fat cores. For massive parallel GPU execution you need loads of simple processors and these are no good for single or few threaded loads.
 
I very much think the same. For fast few threaded CPU execution you need costly branch prediction, out of order and in the case of x86 overly complex instruction decoders resulting in fat cores. For massive parallel GPU execution you need loads of simple processors and these are no good for single or few threaded loads.

Well, I'd like to nuance that a bit... For massive parallellism the cores don't necessarily need to be simple and small from a theoretical point-of-view...
For parallel problems there are two ways to make them faster: make each processing unit faster, or add more processing units.

It's just a practical limitation. With the limited transistor budget, for a GPU it just happens to be more efficient to maximize the number of processors on the chip rather than making each processor run as quickly as possible.

It's just that with non-parallel problems, there's only one way to make them faster, and that's what CPU design has been focusing on for the past 30+ years...
 
It's just a practical limitation. With the limited transistor budget, for a GPU it just happens to be more efficient to maximize the number of processors on the chip rather than making each processor run as quickly as possible.

Ok, one could add that these small simple processors don't necessarily need to be slow, at least at data stream processing with few branches.:smile:
 
Well doom3 tech is supposedly using 20 texture passes...
The exact approach is described here: John Carmack on Doom3 Rendering. It's pretty old technology by today's standards, but it already included some arithmetic work. Also, the depth and stencil passes are purely ROP limited. So when you would run this game on a software renderer, texturing won't be the sole bottleneck any more.
...even HalfLife 2 had quite a few of layers.
Try running it with SwiftShader, once with bilinear filtering and once with point filtering. My results are 23 and 21 FPS, standing in a fixed position, at 800x600 default settings.

So no matter how many textures it's sampling, it's spending far more time doing other things. Trust me, the first time I did these kind of measurements I was surprised too. I had spent months optimizing texture sampling, and I expected it to still be the major bottleneck. But it simply isn't. The arithmetic workload has increased massively compared to texturing.

We've seen the same evolution reflected in hardware architectures. In the fixed-function era a chip with multiple texture units per pipeline was no exception. Nowadays, a chip like the RV790 has 800 stream processors, 40 texture units and 16 ROPs.
 
The remaining bottlenecks for software rendering are not just texture sampling, but also rasterization and transcendental functions, all of which can be greatly accelerated with scatter/gather instructions.
Texture decompression also comes to mind, the way it's done in current graphics hardware would fare very poorly in software. Different approaches would work well without specific decompression hardware and provide the same compression ratios as well as equal or better fidelity. Vector quantization comes to mind as it requires only indirect accesses for 'decompression' and no need for processing the fetched texel data. It can also be made fairly cache-friendly.
 
Back
Top