Fast software renderer

Discussion in 'Rendering Technology and APIs' started by Voxilla, Jul 19, 2009.

  1. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Here a rather fast software renderer engine demo called FQuake.
    It renders some level of the original Quake game.

    http://users.skynet.be/fquake/


    This demo engine is highly optimized making use of multiple threads and SSE code.
    Perspective, bilinear texture mapping runs at 650 Mpix/s on a Quad Core 2, 3.2Ghz.

    The engine makes use of an algorithm that ensures zero overdraw.
    At 2560x1600 resolution the engine runs at between 120 and 160 frames per second, only slightly depending on scene complexity, corresponding to between 500 and 650 Mpix/s texture and pixel fill rate.

    The mapped texture consists of two layers, a material and light map. Though normally you would do this with multi texturing, the engine does it with single texturing. To make this possible a LRU texture cache is maintained with on the fly composted material/light texture maps as needed.

    To make optimally of all CPU cores, the screen is split up according to the number of cores. The splitting positions of the screen are continuously moved to adapt to the scene complexity so that all cores are maximally loaded.

    You can fly through the scene, clicking the mouse left and right buttons for forward and backward movement. Holding the middle button with left/right causes quad speed.
    You can also switch between bilinear and point sampling, by pressing the space bar.
     
  2. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    There was an issue preventing statistics to show up in the title bar, this is fixed now.

    I may also make some DirectX versions to see how much faster a GPU would be or to compare with DirectX software renderers such as SwiftShader or WARP.
     
  3. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    113
    Location:
    New Zealand
    Is that about all you could do with that level of performance and complexity on an X86 GPU? It seems rather primitive, though I haven't thought about software renderers since well Unreal 1 days really.
     
  4. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Actually I first made this renderer over ten years ago, now this version runs 200 times faster. So it is ancient technology revamped with modern techniques.
     
  5. Captain Chickenpants

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    446
    Likes Received:
    14
    Location:
    Kings Langley
    That was a needlessly harsh comment. Can you do any better?
    Try at least to be constructive, when giving negative feedback.

    CC
     
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    I guess i'm missing something but I thought Quake was originally software rendered already? Cool to see this kind of thing though, It really interesting to see how far CPU's have come over the years in relation to rendering graphics.
     
  7. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3
    Have you checked how fast the Quake 1 rasterizer works on the same test machine? This would be an interesting comparison ...
     
  8. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Yes, but it didn't do bilinear filtering. So Jan's renderer looks more like what Quake looked like with hardware rendering in its days, but at far higher resolution and framerate. :)
    It only lets me select 1280x1024 resolution and point filtering, but it caps at 60 FPS (vsync). Anyway the original Quake spends only a few clock cycles on per-pixel work so it's constantly bandwidth limited. You can do bilinear filtering and such 'for free' before being compute limited, and by using SSE and multi-core you get even more possibilities.
     
  9. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3

    Hmm, on my dual-dual-xeon with 3.0 GHz, I get like 186 fps @ 1280x1024 withFQuake, with point filtering -- and 120 with bilinear. Which is by no means bad, but I thought that a rasterizer on a CPU can be much faster than that. Especially after I played around a bit with DX10 WARP, I'm actually pretty impressed how fast a SW rasterizer can be.

    Is this due to being limited by bandwidth? Thread-synchronization? Other issues? If so, then maybe a different test scene would be interesting, as it's hard to judge how fast this really is.

    @Nick: Do you have any peak numbers for SwiftShader2 -- that is, how fast you can render like two textured triangles spanning the whole screen? Just to get an idea where the bottlenecks are.
     
  10. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    I just checked. Nick, with typing timerefresh it can run at full speed. On my system, the original Quake runs about 200 frames per second, at 12820x1024 at some place in the middle of the first hall.
    At the same place with my renderer and point sampling I'm getting 250 frames per second at 2560x1600. So I'm about 4 times faster. In fact I seem to be limited by the 4GB/s PCIe bus. Another difference is that the original does everything with 8 bit per pixel /texel where I'm doing 32-bit rgba.
     
  11. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    Are these Xeons P4 or Core, the Cores are much faster ? With two screens it is slow for the moment, you have to disable one of them. If you have Vista and no Aero then also it can be much slower. To get optimal speed you also need PCIe gen 2.

    Give me some time and I'll make a DX10 version, so we can see how WARP would do.
     
  12. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    SwiftShader's bottlenecks are a bit different. To start with it renders pixel quads to support mipmapping of indirect texturing, so a fair bit of time is spent on transposing data. It's mostly compute limited.

    Also, texture sampling/filtering is not the biggest bottleneck any more with modern games. Switching to point filtering typically only wins you 5-10%. So I actually use game benchmarks much more often than synthetic benchmarks, to find the real bottlenecks...
     
  13. Anteru

    Newcomer

    Joined:
    Jul 4, 2004
    Messages:
    114
    Likes Received:
    3
    These are Core2 based Xeons, and I had my second screen disabled (otherwise, it wouldn't even start). The graphics card is a GTX 280, so it's surely PCIe gen 2? Running on XP, will try on Vista.

    Regarding 4 times faster: Well, if your rasterizer is multithreaded, and you have 4 cores, I would expect at least 4 times the performance of the Quake rasterizer. However, I thought that even on a single core, we can do better now these days using SSE and other optimizations -- surely the Quake 1 code does not take the best advantage of current-gen CPUs.
     
  14. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Yea, if you want to render Quake I as quickly as possible, you take an entirely different approach than when you want to render a modern game as quickly as possible.

    So my question is: what is the goal of this engine? Do you just want to have a super-fast Quake engine, or do you want to have more modern shading aswell?

    Also, can you tell a bit more about the internals of the engine? You say it has zero overdraw, is this done with a spanbuffer approach (another choice which depends a lot on the type of scenes you want to render)?

    And the screen split-up, how is this done? With just horizontal 'bands' for each core, or with interleaved scanlines, checkerboard patters... etc?
    Does this only apply to the rasterizing stage, or do you also do something smart with multiple cores during transform and lighting? Etc... :)
     
  15. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    how funny that point filtering and bilinear look so similar. Crisp and a bit noisy both ways, that must be quite a low texture LOD.
    I like it!, it may be better than the "3dfx blur" I was expecting.

    On a single core (sempron @ ~2.3GHz) I get roughly 30 to 45 megapixels/s with bilinear and 60 to 100 megapixels point filtered. (1024x768 res.)

    I would love your engine to run a counterstrike map (did the quick stupid job of copying and renaming cs_italy.bsp but it doesn't work). I still long for an open source clone of the original half-life / counterstrike 1.5
    I can sense a niche for your engine : the chinese Loongson 3 processor ;).
     
    #15 Blazkowicz, Jul 22, 2009
    Last edited by a moderator: Jul 22, 2009
  16. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    With a GPU the approach would be the same. As the rasterizer is fixed function there is not much choice anyway. With CPU rendering there are plenty of possibilities.


    The goal was just updating an age old engine I wrote 12 years ago and see how much CPUs can be pushed nowadays.
    I currently have no intention to turn it into something more. Applying it to more recent versions of Quake would be interesting though.


    It uses a (parallelized) scanline algorithm, which has numerous advantages for CPU rendering.
    For huge amounts of small triangles it has relative higher overhead, but there might be ways to stretch it more than one would think.


    The split up is with horizontal bands. Other ways like tiles are possible too with little overhead. First I tried fixed size interleaved small bands. I thought this would work best, but it didn’t. Having large adaptive sized bands turned out to be faster. My take is that this works better for the CPU caches. This way textures used by one core rarely end up in the cache of another core that doesn’t need it.


    As the amount of geometry is fairly light this is done single threaded.
     
  17. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Well, GPUs have used various different approaches over the years, from actual fixedfunction single-textured rendering in the days of Quake 1 to today's unified shader architectures (which obviously can still run Quake 1).
    For GPUs the tradeoff is just different.

    Quake II has a very similar engine, so that shouldn't be too much of a problem, I guess. I'm not sure what Quake III does exactly, and Quake IV is totally different ofcourse, you'd get into Nick's territory there, with SwiftShader :)

    I recall back at university when Ewald and I were playing around with Quake I rendering aswell. One thing Ewald had done was dynamic RGB lighting, was pretty cool actually. It would dynamically update the lightmaps by projecting the lightsource into them, and still use the Quake-style pre-cached light*texture rendering.

    Yea, ever seen the Amiga demo "When We Ride On Our Enemies" by Skarla?
    It uses an engine very similar to the Quake 1 engine, but on Amiga. The Amiga had VERY low memory bandwidth, so the spanbuffer + BSP approach was an excellent solution there.

    But these days, with far higher polycounts, a simple z-buffer may be better, because it has no extra per-triangle overhead. I'm not sure where the exact transition point would lie though.

    Yea, it probably wouldn't make much of a difference for a Quake level.
    I've been thinking about how to best multithread a software renderer, but I suppose the best efficiency would come from alternate frame rendering, like SLI/CrossFire.

    Trying to divide up the workload on a level lower than a complete frame will require extra overhead, no matter how you slice it. And you'll also run into more complex load balancing issues.
     
  18. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    With rasterizer I mean the thing that turns geometry into pixels. So the thing that tells you which pixels to shade.
    For the last decade I've been doing GPU rendering, so I know how they have evolved.
     
  19. Scali

    Regular

    Joined:
    Nov 19, 2003
    Messages:
    2,127
    Likes Received:
    0
    Well, I'm used to the term 'rasterizing' being used either as the rasterizing process itself, or the rasterizing and the actual outputting of the pixels.
    In the context of the quote you responded to, I naturally assumed you meant the latter, since I clearly was talking about rendering/shading/multitexturing.

    Can you rephrase what you meant to say? Because I'm not quite sure what to make of it. Clearly hardware rasterizers are not the same as software rasterizers... and even then, rasterizers aren't something that a programmer has control over... so what do you mean?
     
  20. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    you can do it the 3dfx way (as long as the rendering is simple enough) : give each other scanline to each thread, or divide by bands of pixels as done on VSA/100 (the number of pixels was configurable, it was maybe 8 by default)

    The scaling was great, and a voodoo5 could take on ut2003 with a good CPU. (it was maybe limited by triangle setup when there's much geometry)

    I wonder which techniques prevent that method from working with more modern rendering. (shaders that want to be aware of the whole scene or of a pixel in another band?)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...