Why does Ati's stencil performance depend upon the CPU?

K.I.L.E.R

Retarded moron
Veteran
I've done some tests on games which use stencil shadows and shadow buffers, mainly Tenebrae and some of those other demos.

Needless to say I find it quite strange that my performance more than doubles under those supposed fillrate limited situations.

Shouldn't they be entirely done on my video card?
I've got a Radeon 9700 Pro.

Maybe Ati's stencil performance depends upson the CPU because Ati were trying to save tranistors?
 
3DMark03's GT2 & GT3 are heavily relaint on stencil shadowing and fairly invariant to CPU performance.
 
Tenebrae seems a pecular case - as it's the only benchmark (with heavy use of stencil shadows) that halfs in performance when going from No AA to x2 AA on my 9800 Pro.

After comparing the performance difference between my 9600 and my 9800 Pro (both set to no AA) the performance difference is inline with the clockrate and pipeline differences.

But I do wonder how much the CPU is required for Stencil Shadows - I haven't tested this aspect at all.
 
There were some stencilling pecularities with 9700 and HierZ that were sorted out on later generations (and wouldn't affect 9600 as there is no HierZ).
 
Well, stencil shadows can create an awful lot of geometry (and that can consume a lot of CPU). It all has to be created dynamically and fed to the card.
 
PeterAce said:
But I do wonder how much the CPU is required for Stencil Shadows - I haven't tested this aspect at all.

It depends how you generate the shadow volumes.
If on the CPU - then the answer is very.

I'd guess Tenebrae and Doom3 are generating them on the CPU.
3DMark03 GT2&3 are generating them using a vertex shader on the GPU.

Games uses the CPU method because it's faster, 3DMark uses the GPU method because it allows the benchmark to be CPU independent.
 
Hm, I'd be somewhat hesitant to consider Tenebrae in any way useful for benchmarking needs, not more so than say DX:IW or T3:DS (ie. it's inefficient and poorly coded as hell). Also there have been some accusations of it being "just some nVidia techdemos thrown together", I don't know if it's just clueless BS though... :?:
 
anaqer said:
Hm, I'd be somewhat hesitant to consider Tenebrae in any way useful for benchmarking needs, not more so than say DX:IW or T3:DS (ie. it's inefficient and poorly coded as hell). Also there have been some accusations of it being "just some nVidia techdemos thrown together", I don't know if it's just clueless BS though... :?:

As part of the Tenebrae team I'm not entirely impartial in this, but as the text above was pretty harsh, I decided to put my spoon in the soup...

The reason Tenebrae 1 is pretty cpu heavy is mostly due to the legacy of GLQuake's style of rendering practically everything in immediate mode, which is very inefficient for modern GPUs. That makes benchmarking with Tenebrae, even with the same code paths (which would be recommended anyway, unless you want to compare the effects of different paths), in large part benchmarking the driver's handling of immediate mode. Also the different paths on the rendering side were mostly patched on later, which is why things like transforming lights and eye vectors to tangent space are done on CPU before sending stuff to GPU. Still, I wouldn't call it NVidia tech demos thrown together as all the code is made for this project and it usually performs better on ATI hardware ;)

In Tenebrae 2 the rendering side is much improved and we use vertex buffer objects and indexed primitives and what not to improve the driver bottleneck. We aren't still quite done yet with the batching thing, but it's getting pretty good. Also the GPU is doing more in vertex shaders so CPU is relieved from some of the grunt work. Shadow volumes are still generated purely on CPU though as there are optimizations you quite can't do on GPU yet.

As perspective from a recent AMD Code Analyst (nice tool, pretty much like vtune and free to boot) run I did for both projects on Athlon 64 3200+, 9800Pro and Catalyst 4.7 beta the cpu time division was like this:

Tenebrae 1:
43.8% atioglxx.dll
38.3% tenebrae.exe
8.0% ati2cqag.dll
2.3% opengl32.dll
1.8% hal.dll
5.8% others

Tenebrae 2:
44.0% atioglxx.dll
25.0% ati2cqag.dll
13.2% tenebrae2.exe
7.7% hal.dll
2.7% opengl32.dll
7.4% others

Something like NVPerfHud (if it was available for OpenGL) would be interesting tool to use as well (hint hint ;))
 
PeterAce said:
Tenebrae seems a pecular case - as it's the only benchmark (with heavy use of stencil shadows) that halfs in performance when going from No AA to x2 AA on my 9800 Pro.
NWN also has poor FSAA performance on the R3xx (much more of a performance drop with FSAA than with my GeForce4). The reason is probably that the Radeon's Hyper-Z is disabled if the stencil buffer is cleared but the depth buffer is not. I believe it's pretty much necessary to clear the stencil buffer multiple times to render multiple lights when doing DOOM3-style stencil shadowing, so this seems to be a common case.

Anyway, yes, there's a big CPU hit for stencil shadows when the shadow volumes are calcualted on the CPU.
 
Chalnoth said:
PeterAce said:
Tenebrae seems a pecular case - as it's the only benchmark (with heavy use of stencil shadows) that halfs in performance when going from No AA to x2 AA on my 9800 Pro.
NWN also has poor FSAA performance on the R3xx (much more of a performance drop with FSAA than with my GeForce4). The reason is probably that the Radeon's Hyper-Z is disabled if the stencil buffer is cleared but the depth buffer is not.

Both NWN and Tenebrae suffered from the same flaw in using the stencil and Z-buffer clears, though at least we have fixed it since ;) Clearing Z and stencil should be done together when ever possible, which we weren't doing in all cases.

I did some small benchmarking as this got me interested, with my current build (you can get practically the same version from my page at http://www.modeemi.fi/~jpaana/tenebrae/, but without SSE2 usage)
and Athlon 64 3200+, 9800Pro with XT bios (R360 core), timedemo demo1:

Code:
Res:          0x    2x    4x AA
640x400      124.9 113.9 82.1
1600x1200    28.0  24.1  16.5


Chalnoth said:
I believe it's pretty much necessary to clear the stencil buffer multiple times to render multiple lights when doing DOOM3-style stencil shadowing, so this seems to be a common case.
Yes, in worst case you have to clear stencil buffer after every light.

Chalnoth said:
Anyway, yes, there's a big CPU hit for stencil shadows when the shadow volumes are calcualted on the CPU.

Not really, see the above post of mine, Tenebrae 2 with 13% of CPU going to WHOLE game code, all rendering, etc. includes the shadow volume calculations, so I wouldn't call it a big hit. Plus you probably win overall with the saved fillrate.
 
jpaana said:
Not really, see the above post of mine, Tenebrae 2 with 13% of CPU going to WHOLE game code, all rendering, etc. includes the shadow volume calculations, so I wouldn't call it a big hit. Plus you probably win overall with the saved fillrate.
Well, Quake1 has pretty simple geometry, too.
 
Chalnoth said:
jpaana said:
Not really, see the above post of mine, Tenebrae 2 with 13% of CPU going to WHOLE game code, all rendering, etc. includes the shadow volume calculations, so I wouldn't call it a big hit. Plus you probably win overall with the saved fillrate.
Well, Quake1 has pretty simple geometry, too.

That's true, but on the other hand increasing geometry complexity also makes the optimizations possible with CPU more effective so that kind of balances things out. In the end I'd say it's pretty much on a case by case basis depending on factors like ratio of static/dynamic volumes, average number of lights, LOD techniques used etc., so a broad statement like that would most probably be inaccurate at best.
 
Well your updated version is a lot faster (esp when using AA).

It looks like you've fixed the stencil shadow problem.

Thanks you just made my day ;)

Here are some benchmarks, system used :

AMD Athlon XP 2800+ (Barton)
Epox EP-8KRA2+ Motherboard (KT600)
1024 Megabytes RAM (333 DDR)
Sapphire ATI Radeon 9800 Pro 128MB (AGP 8x)
120 GB 7200 RPM Maxtor Hard Disk Drive 8MB Cache (Serial ATA-150)
Realtek ACL655 Sound card

ATI Catalyst 4.5

Map : DEMO1

Code:
Tenebrae 1.04

No AA : 27.7
2X AA : 17.3
4X AA : 9.8

Tenebrae Updated

No AA : 36.4
2X AA : 32.0
4X AA : 22.4

Performance increase between old and new Tenebrae :

No AA : 31.4%
2X AA : 84.97%
4X AA : 128.57%

EDIT to correct calculations : Thanks Chalnoth, I have a cold and my maths is a little off today ;)
 
Um, you calculated your performance increases incorrectly. At 4x, for example, the increase was more than 100% (double).
 
PeterAce said:
Well your updated version is a lot faster (esp when using AA).

It looks like you've fixed the stencil shadow problem.

Thanks you just made my day ;)

Yup, we should probably make a new release, didn't think it is so much faster now as I haven't run the 1.04 release for ages ;)
 
Back
Top