SwiftShader 2.0: A DX9 Software Rasterizer that runs Crysis

The SwiftShader 2.01 update is now available for everyone from the website. It improves performance on AMD processors, and includes several fixes for rendering issues.
 
so youve updated your software renderer in a fraction of the time it takes ati/nvidia to update their drivers
your setting a bad example :D
 
Trials 2 works fine with the new version, and is very playable with post process effects set to minimum. Any change you would write a optimization guide for the renderer any time soon? I just did a some more bilinear -> nearest and float -> half optimizations for our shaders. But I am pretty sure the half floats are not that optimal for software shading, and DEC3->float normal vector conversion is not that fast either.

Is DTX5 faster than ARGB8 in Swiftshader? DXT5 footprint is 4 times smaller, but it requires so many extra operations to decode. We used some kind of DXT compression variation in our old N-Gage software renderer, and it was pretty quick (however slower than N-Gage native 4444 format).
 
New results with V2.01:

cubemap.exe @ 1280x960 (TLB patch enabled)

1 thread: 11.2
2 threads: 17
3 threads: 17
4 threads: 21

TLB patch disabled

1 thread: 13.7
2 threads: 21
3 threads: 21
4 threads: 25
 
The shadows+spheres demo still doesn't work properly. The lighting seems okay now, but the shadowvolumes are still rendered as some random garbage.

The popping polygons in the skinned shadowvolume demo aren't fixed either.
The backfaced polygons are no longer lit in the CSG demo though, and the Prosaic-demo also doesn't crash anymore.
 
Any change you would write a optimization guide for the renderer any time soon?
That might become a good idea in the near future. Currently we have a whitepaper explaining the overall architecture, but we exchange optimization details on a per-customer/application basis. Anyway, here are a few basic guidelines (off the top of my head):
- Nothing comes for free: Any feature you use translates to extra code. For instance to render a 2D menu, disable bilinear filtering and mipmapping. It won't speed up hardware rendering but it will affect software rendering.
- Use data formats 'native' to the CPU: 8-bit, 16-bit, 32-bit data elements are typically faster. Everything else needs to be converted anyway and it's often not worth the bandwidth savings.
- Reduce overdraw: One of the most effective approaches is to first render the scene to the z-buffer only, then perform a second pass with color calculations enabled.
- You're CPU limited anyway: It can be well worth doing some extra work on the application side to reduce the workload of the software renderer.
Is DTX5 faster than ARGB8 in Swiftshader? DXT5 footprint is 4 times smaller, but it requires so many extra operations to decode.
DXTn is converted to ARGB8 before the actual rendering. So aside from the one-time decoding the performance is the same. Sampling ARGB8 is highly optimized but the arithmetic cost is still higher than anything you could save by increasing cache hit rates and reducing bandwidth. Use A8 if you only need an alpha component though.
We used some kind of DXT compression variation in our old N-Gage software renderer, and it was pretty quick (however slower than N-Gage native 4444 format).
Interesting. Was it more like a palettized format? Maybe in the future cache hit ratios and bandwidth become more important for SwiftShader too.
 
Interesting. Was it more like a palettized format? Maybe in the future cache hit ratios and bandwidth become more important for SwiftShader too.

It had all 4 colors stored in the 4x4 block, and no interpolation. We used it only on 2d sprite rendering (and backgrounds). It was fast to use a format like that, because we could blit it in 4x4 blocks. For polygon rendering we used 4444 format, because it was the native format of N-Gage screen buffer.
 
I'm starting to wonder if there's something wrong with this Phenom system.

I got 15FPS for 1 and 2 cores. 6FPS for 3 cores. 4FPS for 4 cores. Phenom 9500, TLB disabled, Windows XP. Cubemap (400x300).


Is there some setting that needs to be done for Phenoms? It seems to run fine on my Q6600 (in Vista no less).
 
Just curious Chrono, are you using Swiftshader 2.01 that has the AMD perf fixes?

I'll give this a whack later tonight on my Q9450 rig and play some TrackMania with it.
 
SS 2.01 results from my 3.6Ghz Q9450 (framerate is the lowest integer value observed in the meter, threading was verified via Task Manager)

Cube Map: 400x300
1 thread: 93fps
2 thread: 128fps
3 thread: 124fps
4 thread: 119fps

Cube Map: 1600x1200 fullscreen
1 thread: 24fps
2 thread: 39fps
3 thread: 40fps
4 thread: 52fps

And yes, I double-checked that last set of scores -- the framerate really didn't change between 2 and 3 cores, but the 4th core suddenly made it a whole lot better. I'll go give it a whack on Track Mania Nations, but given the current results, I'm relatively impressed.

Is there any thought to optimizing further for SSE4.x? There's supposedly a lot of extra floating point calculation power hiding in these Penryn cores if you go that direction...

Edit: trackmania results
With four threads, the most I can stand is 848x480 with minimum settings (pressed "result defaults", then selected minimum). I didn't have a framerate counter, but I'd wager it was somewhere in the high teens. That's not to say I'm disappointed, it's more just for reference. I'll try a few other games just to see, such as GTA3:SA here shortly...
 
Last edited by a moderator:
Is there any thought to optimizing further for SSE4.x? There's supposedly a lot of extra floating point calculation power hiding in these Penryn cores if you go that direction...
Certainly. SSE4 has a number of interesting instructions that could make a difference. Just like x86-64, support for it will be added in due time.
I'll try a few other games just to see, such as GTA3:SA here shortly...
I installed and tested it and unfortunately it was very slow. I quickly located the issue though, and it now runs at 10-20 FPS on my Q6600. That's playable but still a bit dissapointing given the relative graphical simplicity of the game. Maybe the engine doesn't do much to prevent overdraw, or it doesn't render things in large batches. It was first released on PlayStation 2 so it's probably an unoptimized port...
 
Certainly. SSE4 has a number of interesting instructions that could make a difference. Just like x86-64, support for it will be added in due time.

Would it be possible to use an x86-64 dll in 32-bit applications? Since the draw-calls generally take a lot of time, it might be interesting to 'thunk' the call to a 64-bit DLL, which should have some performance advantages given the larger number of registers and other small optimizations in 64-bit mode.
I know that some Windows APIs are implemented this way, but I'm not sure exactly how this is done, and if this is possible for 'regular' DLLs aswell, or if it can only be done for certain system libraries.
 
Nick, was the setting in GTA3 something that I'm able to change as a demo user? If so, do you mind sharing what it was?
 
Just curious Chrono, are you using Swiftshader 2.01 that has the AMD perf fixes?

I'll give this a whack later tonight on my Q9450 rig and play some TrackMania with it.

Yes, that's what surprised me when I had even poorer results than when I tried 2.0

I'll probably give it another shot when I have some time just in case I messed something up royally.
 
Yes, that's what surprised me when I had even poorer results than when I tried 2.0

I'll probably give it another shot when I have some time just in case I messed something up royally.

:D Sorry if it sounded like I was insinuating you were dumb or likewise; I too am surprised at your results.
 
Nick, was the setting in GTA3 something that I'm able to change as a demo user? If so, do you mind sharing what it was?
It's not something that can be fixed with the 2.01 release by changing a setting, unfortunately. The game frequently calls the IDirect3DDevice9::StretchRect function and it was (unnecessarily) taking a very slow generic fallback path. It was a one-line fix to make it much faster but it requires a new build. I don't think there will be another update for some while now though.
 
Ah, well no worries. Not like I don't have other rendering options ;) But yes, it was ungodly slow when I tried -- and I made the mistake of not changing the video settings first, so by the time I made it into the game, it was faster to CTRL-ALT-DEL and kill the task than to wait for the GUI.
 
Nick & Co, very nice work!
I completely agree with you that there is definitely a market for a product like this. It's really incredible how many crappy embedded graphics chips are out there, especially in laptops ... it's really depressing.
Plus, if the business doesn't take off maybe you can check with Intel and see if they need any help with their Larrabee driver :)

@ Scali: Cut Nick some slack will you! I don't understand why you need to be so critical. It's not like you have to do the work!
 
Nick & Co, very nice work!
I completely agree with you that there is definitely a market for a product like this. It's really incredible how many crappy embedded graphics chips are out there, especially in laptops ... it's really depressing.
Plus, if the business doesn't take off maybe you can check with Intel and see if they need any help with their Larrabee driver :)

@ Scali: Cut Nick some slack will you! I don't understand why you need to be so critical. It's not like you have to do the work!

So when do you start work for Intel Nick? :cool:
 
Back
Top