SwiftShader 2.0: A DX9 Software Rasterizer that runs Crysis

Nick · Apr 13, 2008

ChronoReverse said:
Here's something interesting I noticed on my C2Q Q6600 with the cubemap.exe demo.

Running 4 cores is slower than running 3 which is slower than running 2 cores. 2 core is always faster than 1 core though.

Yes, that sample in particular doesn't scale well beyond dual-core. Preliminary analysis seemed to indicate that it's a problem with very low polygon counts per draw call, and inter-core bandwidth and latency. If you increase the resolution you should see some speedup for 4 threads versus 2. Actual games also seem to behave much better with quad-core.

However, even that has an interesting effect. If I select cores 0 and 1 or 2 and 3, I get 110FPS. If I select another combination, I only get 90FPS (a single core gets about 70FPS).

It seems that the shared L2 cache of the C2D makes a huge difference here. Once I have to use the FSB, the performance starts to plummet.

Yes, that's definitely a major factor. It would be interesting to know how this application behaves on a Phenom processor, and of course Nehalem...

Nick · Apr 13, 2008

sebbbi said:
Tested Swiftshader with our newest game Trials 2 Second Edition on my Core 2 Quad + Vista 64 bit computer. The game runs well and all UI graphics are fine and vertex shaders seem to work properly. Post process filters also seem to work fine (motion blur, depth of field, light blooms all seem correct). The skinned driver model is also rendered 100% correctly (with proper lighting and self shadowing also). So the deferred rendering g-buffers contain valid data and all the lighting shaders work. However all mipmapped textures are completely broken. We are mainly using DXT5 compressed textures and manually lock all mip levels and copy the compressed data there. We copy the texture mipdata from one big texture atlas to the mip levels. Is this a problem for Swiftshader, or do you think there is some other problem?

No, SwiftShader should support all mipmapping features. Lots of other games manually set the mipmap levels, without any issues. Also, to my knowledge all WHQL tests related to mipmaping pass. It must be some corner case...

If you want to test the game, you can download the game demo from www.redlynxtrials.com.

Thanks!

Nick · Apr 13, 2008

Scali said:
More bugs: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=30

It lights both the inside and outside of the object. The inside are backfaces and therefore should not be lit, verify against hardware or refrast.

Huh? Could you point me to the SDK documentation that describes this? Should be an easy fix if I knew what it's supposed to do... Looks like some legacy T&L thing. Anyway, thanks for pointing it out.

This one just renders complete garbage: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=31

Looks fine to me aside from the shadow. What technique are you using?

Scali · Apr 13, 2008

Nick said:
You're right, but does it really matter?

Well in a way it does. These cards were notorious because they couldn't actually run SM2.0 games at acceptable framerates. I don't think it's a good way to promote your product by comparing it to another product that is equally incapable of running SM2.0 software.

Nick said:
Well, sure, but then the game needs a Shader Model 1.x path, and you miss all the Shader Model 2.0 effects.

Back when these cards were on the market, games were primarily SM1.x anyway, so that was not a problem. SM2.0 was mainly used to improve overall image quality a bit, and reduce the number of renderpasses in some cases. With a game like Half-Life 2, the difference between the PS1.4-path and the PS2.0-path was only minor and quite subtle (just like with Crysis today, with DX9 vs DX10... the difference is there, but it's not like you notice it instantly, you pretty much have to take screenshots and do a side-by-side comparison). Yet the performance difference on the FX series was dramatic.

So the saving grace of the FX series was that it still ran SM1.x stuff properly.

Nick said:
Interesting, but not really a big surprise. Obviously you can optimize the processing for exactly the things you need, while with SwiftShader a detour is taken with DirectX shaders.

However, it might be wise to have the shader-level user-configurable (I'm not sure if I can put in version 0 to force only fixedfunction T&L?). That way you can pick the path that makes the software run fastest. Get the detour that suits you best, so to say.

Nick said:
Again, impressive for a Java renderer, but totally incomparable. First of all it's not the same application. And because you wrote both the application and the renderer it can be a lot faster than a software renderer that can render Crysis, Unreal Tournament 3, Call of Duty 4, etc. which are aimed at hardware rendering.

I think it's good to know just how big that gap is though. Not just in terms of performance, but the Java version also doesn't have any popping artifacts or anything (they use the exact same 3dsmax model and pretty much identical skinning code as far as is possible between C++ and Java).

Scali · Apr 13, 2008

Nick said:
Huh? Could you point me to the SDK documentation that describes this? Should be an easy fix if I knew what it's supposed to do... Looks like some legacy T&L thing. Anyway, thanks for pointing it out.

I'm not sure if it's actually specified explicitly. Isn't it a result of the fact that vertex normals aren't flipped automatically, and light is clamped to 0..1 range?
It's been years since I looked into that, but it's one of the rendering rules of Direct3D, since refrast does it that way, as does all the hardware I tested.

Nick said:
Looks fine to me aside from the shadow. What technique are you using?

They're just some stencil shadows. I believe this one also rendered correctly in refrast. It renders correctly on all ATi, nVidia and Intel GPUs I've tested over the years at least, which ranges from GeForce2/Radeon 7000 series to today. So the chance that there's some major bug in my code is unlikely.
I have no idea what it is trying to render, it just looks like random polys for the stencil shadows. It also seems to render the lightsource itself wrongly. You see a bright highlight on all the walls (and also in the reflections on the spheres). With hardware you only see a highlight when the lightsource gets close to a wall.

fellix · Apr 13, 2008

ChronoReverse said:
It seems that the shared L2 cache of the C2D makes a huge difference here. Once I have to use the FSB, the performance starts to plummet.

Yep, the two dies on a shared external host (NB) are communication much slower.
Here is a simple representation on the effect, using cache snooping:

Code:

Q6600_3825_cache2cache_latency
 
CPU0<->CPU1:       24.6 ns 
CPU0<->CPU2:      106.8 ns 
CPU0<->CPU3:      106.4 ns 
CPU1<->CPU2:      107.0 ns 
CPU1<->CPU3:      105.9 ns 
CPU2<->CPU3:       24.6 ns

By the numbers, you can guess which CPUs (e.g. cores) are residing within the same die (shared L2) or the other one. And by all possible snooping combinations listed above, it's clear that the chance a thread/core to hit the external slower FSB to gain access to shared data set is higher than keeping all the workload in its own cache.

Dunnington with its 16MB L3 would shine on that matter, I guess.

Scali · Apr 13, 2008

This one just crashes at startup: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=39

Scali · Apr 14, 2008

fellix said:
By the numbers, you can guess which CPUs (e.g. cores) are residing within the same die (shared L2) or the other one.

Did you ever see cache2cache results of the Barcelona/Phenom architecture?
Check these numbers: http://www.anandtech.com/showdoc.aspx?i=3091&p=4

Basically going through Barcelona's L3-cache is about as slow as a Kentsfield going through the FSB. So much for 'native quadcore'.

OpenGL guy · Apr 14, 2008

Scali said:
Did you ever see cache2cache results of the Barcelona/Phenom architecture?
Check these numbers: http://www.anandtech.com/showdoc.aspx?i=3091&p=4

Basically going through Barcelona's L3-cache is about as slow as a Kentsfield going through the FSB. So much for 'native quadcore'.

Those were Barcelonas with the TLB bug, right? Fixed versions of the chip may make a difference.

Nick · Apr 14, 2008

Scali said:
Here's something else you can ponder about:
http://bohemiq.scali.eu.org/forum/viewtopic.php?t=35
This runs about 50% faster with the CPU vertex path than with the shader path on SwiftShader. There also seem to be some nasty 'popping' polygons with SwiftShader, where it seems to render properly on hardware.

I tried modifying the HLSL_SkinnedShadowVolume.vsh file to attempt to isolate the issue, but when I remove the transformation at the end the application crashes (inside Shaders.exe, using either software or hardware rendering).

Do you by any chance still have the source code for that demo? It would be very useful to freeze the animation at a frame where the "nasty" popping polygons occur. The (unmodified) demo runs correctly on REF so there's still a good chance there's also a SwiftShader bug. Thanks.

Scali · Apr 14, 2008

OpenGL guy said:
Those were Barcelonas with the TLB bug, right? Fixed versions of the chip may make a difference.

This was before a TLB fix was issued, hence the performance should be identical to a fixed version, the stability is just less, but that doesn't affect the test results (at least, it doesn't seem to in the other benchmarks they've run, if you compare them to new benchmarks with fixed CPUs).
If anyone here has a fixed Barcelona or Phenom, perhaps they can support this with their own cache2cache results.

Scali · Apr 14, 2008

Nick said:
I tried modifying the HLSL_SkinnedShadowVolume.vsh file to attempt to isolate the issue, but when I remove the transformation at the end the application crashes (inside Shaders.exe, using either software or hardware rendering).

Well, you have to make sure that the shader still contains valid HLSL code and outputs in the proper format, else the validation of the compiler fails, which I probably didn't bother to handle.

Nick said:
Do you by any chance still have the source code for that demo? It would be very useful to freeze the animation at a frame where the "nasty" popping polygons occur. The (unmodified) demo runs correctly on REF so there's still a good chance there's also a SwiftShader bug. Thanks.

I'd have to see... I may have kept the routines around, but I'd have to piece the actual demo back together (I abandoned this codebase a while ago and moved to DX10/VS2008).
I've also seen that especially in low resolutions 3dmark03's shadowvolumes contain some 'cracks' aswell with SwiftShader (which hardware doesn't seem to suffer from). This demo uses a pretty much identical technique. Perhaps your vertex pipeline runs at slightly lower precision than what D3D requires? Shadowvolumes tend to have weird-shaped polygons anyway, putting a lot of stress on the precision of the whole transform-clip-rasterize process. Even projecting to w=0 has to work. Refrast does all this perfectly, but ofcourse at the cost of performance.
It's a shame that the Doom3 engine is not D3D, it'd be a great test for shadowvolume issues.
F.E.A.R. also uses shadowvolumes quite extensively though, so you might try that. And Far Cry uses shadowvolumes in some areas aswell.

aaronspink · Apr 14, 2008

Scali said:
Did you ever see cache2cache results of the Barcelona/Phenom architecture?
Check these numbers: http://www.anandtech.com/showdoc.aspx?i=3091&p=4

Basically going through Barcelona's L3-cache is about as slow as a Kentsfield going through the FSB. So much for 'native quadcore'.

OMG, that atrociously bad, and somewhat hilarious in light of all the AMD true quad core talk! AMD really needs to move away from exclusive caches to either inclusive or pseudo-inclusive caches. Exclusive caches rarely provides a benefit and comes will all types of complications, wasted power and bandwidth, not to mention increases snoop latency. Kinda hilarious how much they hyped it.

Aaron Spink
speaking for myself inc.

Nick · Apr 14, 2008

sebbbi said:
However all mipmapped textures are completely broken.

Fixed it!

fellix · Apr 14, 2008

Scali said:
Did you ever see cache2cache results of the Barcelona/Phenom architecture?
Check these numbers: http://www.anandtech.com/showdoc.aspx?i=3091&p=4

Basically going through Barcelona's L3-cache is about as slow as a Kentsfield going through the FSB. So much for 'native quadcore'.

The nanosecond measurement here is an absolute scale. If you take into consideration the clock rate of the two CPUs (2000MHz for Barcelona and 3825MHz for Kentsfield) and turn the nanoseconds in clock ticks, it will turn out that it takes--for the async NB & L3 stack in Barcelona--25% less cycles to pass a snoop request, than the external NB on Kentsfield (but much more than just reading the local shared L2).

Nick · Apr 14, 2008

Scali said:
Well in a way it does. These cards were notorious because they couldn't actually run SM2.0 games at acceptable framerates. I don't think it's a good way to promote your product by comparing it to another product that is equally incapable of running SM2.0 software.

I get your point, but still I think it's quite impressive that it matches the same performance even if it requires 'flawed' hardware to compare against. The GeForce FX series was designed by a team of professionals while SwiftShader 2.0 was made with very limited resources (here cometh the excuses). Maybe it's flawed too and the next version will be significantly faster.

And for the record, TransGaming never made any explicit comparison against these or any other graphics cards. It's quite pointless anyway because performance depends so much on the CPU and the application. So we leave the benchmarking to the public and potential clients and let them make their own conclusions. I'm personally pleasantly surprised that it it scores in the same ballpark for 3DMark03, but your mileage may vary.

Back when these cards were on the market, games were primarily SM1.x anyway, so that was not a problem. SM2.0 was mainly used to improve overall image quality a bit, and reduce the number of renderpasses in some cases. With a game like Half-Life 2, the difference between the PS1.4-path and the PS2.0-path was only minor and quite subtle (just like with Crysis today, with DX9 vs DX10... the difference is there, but it's not like you notice it instantly, you pretty much have to take screenshots and do a side-by-side comparison). Yet the performance difference on the FX series was dramatic.

So the saving grace of the FX series was that it still ran SM1.x stuff properly.

Absolutely, but now it's 2008 and casual games start using Shader Model 2.0 (or would want to) while there's still an important number of people out there with hardware that doesn't support it (properly). I believe SwiftShader is very fast at what it does but actually supporting Shader Model 2.0 and nearly every other Direct3D 9 feature, entirely identically on every system, is what gives it its most value in my opinion.

The whole comparison with hardware is very interesting but the only thing that really matters is whether or not the performance is adequate for the casual game in question. A large number of them render at a fixed resolution and have no settings to increase the quality. So as soon as the framerate is playable there's no reason to point out that hardware can do it faster. SwiftShader costs the end user nothing and especially casual gamers don't intend to upgrade just to play a simple game.

However, it might be wise to have the shader-level user-configurable. That way you can pick the path that makes the software run fastest. Get the detour that suits you best, so to say.

It is user configurable in the SwiftShader.ini file. You can also modify it by browsing to http://localhost:8080/swiftconfig while the game is running, but you'll need to restart the application to have it take effect (most other settings are interactive though).

sebbbi · Apr 14, 2008

Nick said:
Fixed it!

Nice work, and so quickly! Is the evaluation version at website updated now, or could you give us a estimated time for the next release? We are currently evaluating Swiftshader for the game. Many of our customers have pretty fast business laptops (good CPU but bad GPU), and the game needs SM2.0 support with MRT rendering support (basically a fully featured DX9 hardware).

Nick · Apr 14, 2008

Scali said:
More bugs: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=30

It lights both the inside and outside of the object. The inside are backfaces and therefore should not be lit, verify against hardware or refrast.

Fixed it!

Scali · Apr 14, 2008

Nick said:
Absolutely, but now it's 2008 and casual games start using Shader Model 2.0 (or would want to) while there's still an important number of people out there with hardware that doesn't support it (properly). I believe SwiftShader is very fast at what it does but actually supporting Shader Model 2.0 and nearly every other Direct3D 9 feature, entirely identically on every system, is what gives it its most value in my opinion.

Well I've always wondered about that. I think it's pretty safe to assume that you'll at least need a Core2 Duo system to get any kind of reasonable SM2.0-performance out of SwiftShader, no matter how simple the graphics (even that DolphinVS demo doesn't run all that fast, and it doesn't get much simpler than that). The worst you could have in such a system is an Intel IGP, I suppose.
I happen to have one of those myself, in my laptop with a 1.5 GHz Core2 Duo processor, an X3100. Not only is the X3100 significantly faster in most applications than SwiftShader is, it also seems to have much better DX8/DX9 driver support. Pretty much all DX8/DX9 demos that I threw at SwifthShader bugged at one point or another, from rendering wrongly to just crashing altogether. Everything works flawlessly on the Intel. Now surely, the Intel drivers are far from perfect themselves, but my first impression is that they're still leaps and bounds ahead of SwiftShader.
Note also that the X3100 is technically an SM4.0-part. There are no DX10-drivers yet, but they were scheduled for Q1'08, so they should arrive any day now. Which will explode the installed base of DX10-capable computers tremendously, as Intel is one the largest players in the GPU-world. It may also have a positive effect on performance and compatibility, as DX10 has a much cleaner and simpler driver model, which leaves less room for bugs and suboptimal implementations. I can't wait to install those drivers and try Crysis... and my own DX10 stuff ofcourse.
So am I correct that with my Intel-powered laptop and 'fast' dualcore I should be in your target market? In which case, why do I not feel like SwiftShader offers me anything over my poor cheap X3100? Well, it does generate more heat, and I get to switch the batteries more often, so at wintertime it's nice, and the forced pauses reduce RSI I suppose.

Nick · Apr 14, 2008

Scali said:
This one just crashes at startup: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=39

The application attempts to create zero-area textures. SwiftShader allows this but REF returns an error code (not described in the SDK documentation). After matching that behavior the application still crashes, when trying to use the null pointer returned.

Anyway, at least the corner case of creating zero-area textures has been fixed. Thanks for that. If you get the chance, could you please see why the application is attempting to create zero-area textures in the first place?

SwiftShader 2.0: A DX9 Software Rasterizer that runs Crysis

Similar threads