Dynamic Branching Benchmark

JeGX

Newcomer
Hi all!

here is a small OpenGL benchmark I coded for my graphics cards tests:
http://www.ozone3d.net/demos_projects/soft_shadows_benchmark.php

This benchmark is focused on the pixel processing unit: soft shadows rendering
with a 7x7 filter. The bench lasts 1 minute. You can enable or disable the dynamic
branching in the pixel shader in order to see the impact of branching.

Here are some results:
X1950XTX / C6.9 / Branching OFF : 1805 o3Marks
X1950XTX / C6.9 / Branching ON : 3634 o3Marks

7950GX2 / FW91.47 / Branching OFF : 2306 o3Marks
7950GX2 / FW91.47 / Branching ON : 2125 o3Marks

7600GS / FW97.28 / Branching OFF : 373 o3Marks
7600GS / FW97.28 / Branching ON : 352 o3Marks

6600GT / FW97.28 / Branching OFF : 391 o3Marks
6600GT / FW97.28 / Branching ON : 603 o3Marks

6800GT / FW97.28 / Branching OFF : 583 o3Marks
6800GT / FW97.28 / Branching ON : 821 o3Marks

We can see the benefits of dynamic branching: on radeon R5xx the perf are doubled (ratio=2).
But on G7x the perf stay almost the same (ratio=0.9).

The weird thing is on NV4x arch (6600GT/6800GT), the branching in pixel shader seems to be
more efficient (ratio=1.4) than on G7x. Comes from NV4x's threads that contain less pixels ?
If anyone has an explanation...

Thanks,
JeGX
 
I'm not sure that I'd jump to any conclusions yet... it's possible that the 8800 is simply bottlenecked elsewhere when DB is enabled.

Specifically what is the demo doing? What is it branching on? Is the "7x7" convolution filter PCF, or some post-processing screen-space shader?
 
Specifically what is the demo doing? What is it branching on? Is the "7x7" convolution filter PCF, or some post-processing screen-space shader?

The source code of the pixel shader (ps_7x7_bluring_kernel_v3b_tex.glsl) is in the data/ directory.
Here is the branching code:

Code:
if( (shadowColor-1.0) * shadowColor * lambertTerm != 0.0 )
{
	float kernel_sum = 0.0;
	int i = 0;
	shadowColor =0.0;

	for( i=0; i<KERNEL_SIZE; i++ )
	{
		kernel_sum += kernel[i];
		shadowColor += shadow2D(shadowMap, shadowUV.xyz + offset[i]).x * kernel[i];
	}	
	
	shadowColor /= kernel_sum;
}

In a word, if the current fragment is located on the edge of the shadow the pixel shader
perform the shadow map filtering (KERNEL_SIZE texture lookup) with a mean filter.
The basic idea comes from a nvidia demo.
 
Ahh...

No UI pops up, it just launches into the benchmark... So I can't get any data.
 
From my system. I took these with 16xAA CSAA and 16xAF enabled though. .

EVGA 680I SLI
E6300 @ 2.8 Ghz
Geforce 8800GTX SLI @ Stock Clocks.


No SLI. 16xCSAA/16xAF

Code:
No Branching

3170

With Branching

3965

With SLI 16xCSAA/16xAF

Code:
No Branching

6098

Branching

7568

With SLI with AA/AF Disabled.

Code:
No Branching


7946

Branching

10328
 
Last edited by a moderator:
I get same.
Readme says its version 1.5 though & exe is 7 Jan.

I had this problem as well - I just forced Windows to run the program in 640x480, which does bring up the frontend popup. The benchmarking portion will still run at 1280x1024, so you can get the results.
 
Cool, that fixed it :)
DB off 1108
DB on 2254

E6600 @stock
X1900GT also @stock, cat 6.12

Some funny artifacting where two shadows overlap with DB on though.
Artifacting is not present in the DB off case.

Edit, screenshot, looks particularly bad where the rotating doughnuts are approaching & leaving the bit where the two sides shadows meet and in the middle under the balls.
 
Last edited by a moderator:
I had this problem as well - I just forced Windows to run the program in 640x480, [...]
How did you do that?

edit: Win2k-Compatibility-Mode seems to work also to get to see the GUI again...
 
Last edited by a moderator:
The source code of the pixel shader (ps_7x7_bluring_kernel_v3b_tex.glsl) is in the data/ directory.
Here is the branching code:
<SNIP>
In a word, if the current fragment is located on the edge of the shadow the pixel shader
perform the shadow map filtering (KERNEL_SIZE texture lookup) with a mean filter.
The basic idea comes from a nvidia demo.

A cool thing to try is vary the kernel radius with the distance to the nearest occluder :) Or you can take into account the distance to the light as well and estimate the penumbra size. As arrrse has pointed out it's a tricky technique to make 100% robust!
 
GF7800 AGP FW 93.71
DBranching on: 591
DBranching off: 635

:(
 
Here's an image showing roughly the amount of work done at each pixel (alternatively the amount avoided by branching) - brighter is more shadow map reads: Branching.

A few notes:

1) ATI is potentially doing less work here as it does not support hardware bilinear PCF. Note the blocky shadows. In any case for custom PCF kernels like this it's best to just do the bilinear weighting of the border pixels yourself and avoid the redundant reads of doing a bilinear lookup at every pixel in the kernel. You can even get fancy and manipulate the bilinear fraction based on your kernel weightings, but that may or may not be worth it.

2) I wouldn't necessarily conclude that ATI's branching is more efficient than the 8800 series, as 49 texture reads might not necessarily be enough to totally bottleneck an 8800 (in my experience) :) In particular it looks like you're doing multipass lighting which is a big vertex transform (perhaps not in this scene) and fill rate killer.

3) The method used to "detect whether you're in a shadow edge" region is a bit error-prone. I remember the original NVIDIA paper and it was error-prone then as well ;) It's certainly possible to get a more continuous and conservative estimate (VSM will give you it for example) and that might play even nicer with the branching hardware in both cards.

That said, interesting demo and results. I suspect comparing ATI/NVIDIA though we're more testing texture read/bandwidth than dynamic branching efficiency. The latter is probably best tested by something synthetic like GPUBench due to the large number of potentially confounded factors.
 
Last edited by a moderator:
A cool thing to try is vary the kernel radius with the distance to the nearest occluder :) Or you can take into account the distance to the light as well and estimate the penumbra size. As arrrse has pointed out it's a tricky technique to make 100% robust!

Thanks for the tip pocketmoon66. I will look at it when I'll add soft shadows in my main software.
 
1) ATI is potentially doing less work here as it does not support hardware bilinear PCF.

You're right AndyTX, that's why I set the shadow map filtering to GL_NEAREST in order to disable nVidia PCF.

Note the blocky shadows. In any case for custom PCF kernels like this it's best to just do the bilinear weighting of the border pixels yourself and avoid the redundant reads of doing a bilinear lookup at every pixel in the kernel. You can even get fancy and manipulate the bilinear fraction based on your kernel weightings, but that may or may not be worth it.

Yes with my simple mean filter the shadow is a bit blocky:
shadow_map_filtering_nearest_20070109.jpg


But with the help of the hardware (nvidia only - GL_LINEAR ) and with the same 7x7 filter, the shadow is better:
shadow_map_filtering_linear_20070109.jpg
 
You're right AndyTX, that's why I set the shadow map filtering to GL_NEAREST in order to disable nVidia PCF.
Ok, cool. I didn't have an NVIDIA machine here handy to test with :)

But with the help of the hardware (nvidia only - GL_LINEAR ) and with the same 7x7 filter, the shadow is better:
Sure. What I'm saying is that you can get the same linear filtering by sampling an 8x8 rectangle and applying the proper bilinear weights to the border pixels (all of the interior pixel bilinear weights will sum to one). This will work on all cards and for sufficiently large filter regions will be as fast or faster than hardware bilinear (i.e. GL_LINEAR).
 
Couldn`t Fetch4 help you to some extent in this particular situation?
 
Back
Top