Dynamic Branching Benchmark

JeGX · Jan 8, 2007

Hi all!

here is a small OpenGL benchmark I coded for my graphics cards tests:
http://www.ozone3d.net/demos_projects/soft_shadows_benchmark.php

This benchmark is focused on the pixel processing unit: soft shadows rendering
with a 7x7 filter. The bench lasts 1 minute. You can enable or disable the dynamic
branching in the pixel shader in order to see the impact of branching.

Here are some results:
X1950XTX / C6.9 / Branching OFF : 1805 o3Marks
X1950XTX / C6.9 / Branching ON : 3634 o3Marks

7950GX2 / FW91.47 / Branching OFF : 2306 o3Marks
7950GX2 / FW91.47 / Branching ON : 2125 o3Marks

7600GS / FW97.28 / Branching OFF : 373 o3Marks
7600GS / FW97.28 / Branching ON : 352 o3Marks

6600GT / FW97.28 / Branching OFF : 391 o3Marks
6600GT / FW97.28 / Branching ON : 603 o3Marks

6800GT / FW97.28 / Branching OFF : 583 o3Marks
6800GT / FW97.28 / Branching ON : 821 o3Marks

We can see the benefits of dynamic branching: on radeon R5xx the perf are doubled (ratio=2).
But on G7x the perf stay almost the same (ratio=0.9).

The weird thing is on NV4x arch (6600GT/6800GT), the branching in pixel shader seems to be
more efficient (ratio=1.4) than on G7x. Comes from NV4x's threads that contain less pixels ?
If anyone has an explanation...

Thanks,
JeGX

Geeforcer · Jan 8, 2007

8800 GTX:

4492 DB Off
5925 DB On

Geo · Jan 8, 2007

Geeforcer said:
8800 GTX:

4492 DB Off
5925 DB On

Errrm? So according to this test G80 has negative branching performance? (Edit: Nope, didn't read it right). But less branching performance than R580?

Andrew Lauritzen · Jan 9, 2007

I'm not sure that I'd jump to any conclusions yet... it's possible that the 8800 is simply bottlenecked elsewhere when DB is enabled.

Specifically what is the demo doing? What is it branching on? Is the "7x7" convolution filter PCF, or some post-processing screen-space shader?

JeGX · Jan 9, 2007

AndyTX said:
Specifically what is the demo doing? What is it branching on? Is the "7x7" convolution filter PCF, or some post-processing screen-space shader?

The source code of the pixel shader (ps_7x7_bluring_kernel_v3b_tex.glsl) is in the data/ directory.
Here is the branching code:

Code:

if( (shadowColor-1.0) * shadowColor * lambertTerm != 0.0 )
{
	float kernel_sum = 0.0;
	int i = 0;
	shadowColor =0.0;

	for( i=0; i<KERNEL_SIZE; i++ )
	{
		kernel_sum += kernel[i];
		shadowColor += shadow2D(shadowMap, shadowUV.xyz + offset[i]).x * kernel[i];
	}	
	
	shadowColor /= kernel_sum;
}

In a word, if the current fragment is located on the edge of the shadow the pixel shader
perform the shadow map filtering (KERNEL_SIZE texture lookup) with a mean filter.
The basic idea comes from a nvidia demo.

Graham · Jan 9, 2007

Ahh...

No UI pops up, it just launches into the benchmark... So I can't get any data.

hoom · Jan 9, 2007

I get same.
Readme says its version 1.5 though & exe is 7 Jan.

ChrisRay · Jan 9, 2007

From my system. I took these with 16xAA CSAA and 16xAF enabled though. .

EVGA 680I SLI
E6300 @ 2.8 Ghz
Geforce 8800GTX SLI @ Stock Clocks.

No SLI. 16xCSAA/16xAF

Code:

No Branching

3170

With Branching

3965

With SLI 16xCSAA/16xAF

Code:

No Branching

6098

Branching

7568

With SLI with AA/AF Disabled.

Code:

No Branching


7946

Branching

10328

Geeforcer · Jan 9, 2007

arrrse said:
I get same.
Readme says its version 1.5 though & exe is 7 Jan.

I had this problem as well - I just forced Windows to run the program in 640x480, which does bring up the frontend popup. The benchmarking portion will still run at 1280x1024, so you can get the results.

hoom · Jan 9, 2007

Cool, that fixed it

DB off 1108
DB on 2254

E6600 @stock
X1900GT also @stock, cat 6.12

Some funny artifacting where two shadows overlap with DB on though.
Artifacting is not present in the DB off case.

Edit, screenshot, looks particularly bad where the rotating doughnuts are approaching & leaving the bit where the two sides shadows meet and in the middle under the balls.

CarstenS · Jan 9, 2007

Geeforcer said:
I had this problem as well - I just forced Windows to run the program in 640x480, [...]

How did you do that?

edit: Win2k-Compatibility-Mode seems to work also to get to see the GUI again...

pocketmoon66 · Jan 9, 2007

JeGX said:
The source code of the pixel shader (ps_7x7_bluring_kernel_v3b_tex.glsl) is in the data/ directory.
Here is the branching code:
<SNIP>
In a word, if the current fragment is located on the edge of the shadow the pixel shader
perform the shadow map filtering (KERNEL_SIZE texture lookup) with a mean filter.
The basic idea comes from a nvidia demo.

A cool thing to try is vary the kernel radius with the distance to the nearest occluder

Or you can take into account the distance to the light as well and estimate the penumbra size. As arrrse has pointed out it's a tricky technique to make 100% robust!

Bludd · Jan 9, 2007

GF7800 AGP FW 93.71
DBranching on: 591
DBranching off: 635

Andrew Lauritzen · Jan 9, 2007

Here's an image showing roughly the amount of work done at each pixel (alternatively the amount avoided by branching) - brighter is more shadow map reads: Branching.

A few notes:

1) ATI is potentially doing less work here as it does not support hardware bilinear PCF. Note the blocky shadows. In any case for custom PCF kernels like this it's best to just do the bilinear weighting of the border pixels yourself and avoid the redundant reads of doing a bilinear lookup at every pixel in the kernel. You can even get fancy and manipulate the bilinear fraction based on your kernel weightings, but that may or may not be worth it.

2) I wouldn't necessarily conclude that ATI's branching is more efficient than the 8800 series, as 49 texture reads might not necessarily be enough to totally bottleneck an 8800 (in my experience)

In particular it looks like you're doing multipass lighting which is a big vertex transform (perhaps not in this scene) and fill rate killer.

3) The method used to "detect whether you're in a shadow edge" region is a bit error-prone. I remember the original NVIDIA paper and it was error-prone then as well

It's certainly possible to get a more continuous and conservative estimate (VSM will give you it for example) and that might play even nicer with the branching hardware in both cards.

That said, interesting demo and results. I suspect comparing ATI/NVIDIA though we're more testing texture read/bandwidth than dynamic branching efficiency. The latter is probably best tested by something synthetic like GPUBench due to the large number of potentially confounded factors.

Miksu · Jan 9, 2007

X1950Pro & Cat 6.12:

DB off: 1288
DB on: 2583

JeGX · Jan 9, 2007

pocketmoon66 said:
A cool thing to try is vary the kernel radius with the distance to the nearest occluder Or you can take into account the distance to the light as well and estimate the penumbra size. As arrrse has pointed out it's a tricky technique to make 100% robust!

Thanks for the tip pocketmoon66. I will look at it when I'll add soft shadows in my main software.

JeGX · Jan 9, 2007

AndyTX said:
1) ATI is potentially doing less work here as it does not support hardware bilinear PCF.

You're right AndyTX, that's why I set the shadow map filtering to GL_NEAREST in order to disable nVidia PCF.

AndyTX said:
Note the blocky shadows. In any case for custom PCF kernels like this it's best to just do the bilinear weighting of the border pixels yourself and avoid the redundant reads of doing a bilinear lookup at every pixel in the kernel. You can even get fancy and manipulate the bilinear fraction based on your kernel weightings, but that may or may not be worth it.

Yes with my simple mean filter the shadow is a bit blocky:

But with the help of the hardware (nvidia only - GL_LINEAR ) and with the same 7x7 filter, the shadow is better:

Andrew Lauritzen · Jan 9, 2007

JeGX said:
You're right AndyTX, that's why I set the shadow map filtering to GL_NEAREST in order to disable nVidia PCF.

Ok, cool. I didn't have an NVIDIA machine here handy to test with

JeGX said:
But with the help of the hardware (nvidia only - GL_LINEAR ) and with the same 7x7 filter, the shadow is better:

Sure. What I'm saying is that you can get the same linear filtering by sampling an 8x8 rectangle and applying the proper bilinear weights to the border pixels (all of the interior pixel bilinear weights will sum to one). This will work on all cards and for sufficiently large filter regions will be as fast or faster than hardware bilinear (i.e. GL_LINEAR).

TG01 · Jan 10, 2007

JeGX said:
X1950XTX / C6.9 / Branching OFF : 1805 o3Marks
X1950XTX / C6.9 / Branching ON : 3634 o3Marks

Sweet ..

my X1900XT256MB C6.12 gets

1944 DB OFF
3720 DB ON

AlexV · Jan 11, 2007

Couldn`t Fetch4 help you to some extent in this particular situation?

Dynamic Branching Benchmark

JeGX

Geeforcer

Harmlessly Evil

Geo

Mostly Harmless

Andrew Lauritzen

Moderator

JeGX

Graham

Hello :-)

hoom

ChrisRay

<span style="color: rgb(124, 197, 0)">R.I.P. 1983-

Geeforcer

Harmlessly Evil

hoom

CarstenS

Moderator

pocketmoon66

Bludd

Experiencing A Significant Gravitas Shortfall

Andrew Lauritzen

Moderator

Miksu

JeGX

JeGX

Andrew Lauritzen

Moderator

TG01

AlexV

Heteroscedasticitate

Similar threads