MSAA + HDR benchmarks ?

Jawed · Oct 25, 2005

Hmm, is there an article on B3D that describes explicitly what this test is doing?

I'll be honest and say this particular part of B3D's reviews has never meant a great deal to me - i.e. I've never understood it, whether as a tool for comprehending AA scaling on an architecture, or for comparing competing architectures.

In other words, I just look to the game benchmarks to get a feel for AA performance...

Jawed

KimB · Oct 26, 2005

Xmas said:
An implementation could take advantage of early accept for the Z-test, so it could write more Z-samples than it can test per clock.

I was thinking more along the lines of Equal tests, where no value need ever be written to the z-buffer. It seems to me that that situation could allow for a very nice speedup for hardware that allows it. This situation would be the norm for rendering where you do an initial z-pass, which would be highly useful anyway for optimizing performance for long shaders.

Xmas · Oct 26, 2005

Chalnoth said:
I was thinking more along the lines of Equal tests, where no value need ever be written to the z-buffer. It seems to me that that situation could allow for a very nice speedup for hardware that allows it. This situation would be the norm for rendering where you do an initial z-pass, which would be highly useful anyway for optimizing performance for long shaders.

But OpenGL guy wrote that the z-compare is limiting, not z-write. The Equal test has to be done per sample, you can do early tile reject but not early tile accept. Besides, there's no point increasing the sample output rate for long shaders.

KimB · Oct 26, 2005

Xmas said:
But OpenGL guy wrote that the z-compare is limiting, not z-write. The Equal test has to be done per sample, you can do early tile reject but not early tile accept.

Right, but the z-buffer accesses should be decoupled from the color buffer accesses if you want to make optimal use of the pixel pipelines. But I suppose it's all pretty much a mute point anyway: most of the time when you're not running any sort of pixel shader (at least, not one more than a couple of instructions long), you're going to be completely limited by z-accesses anyway, as the color buffer might not be written to at all (here I'm talking about shadow volume and shadow map rendering).

One place that accelerated color writes with FSAA could help would be with full-screen writes, such as in applying bloom effects, but those will probably have shaders too long for it to make a difference (all it would take is a shader that takes 3 clocks to execute).

Xmas · Oct 26, 2005

You don't want to use a multisampled render target for full screen writes.

KimB · Oct 26, 2005

Xmas said:
You don't want to use a multisampled render target for full screen writes.

You might if you're applying a bloom effect to the framebuffer (i.e. not using a rendered texture).

Xmas · Oct 26, 2005

Chalnoth said:
You might if you're applying a bloom effect to the framebuffer (i.e. not using a rendered texture).

And how do you add a bloom effect without using the rendered scene as a texture?

Mintmaster · Oct 26, 2005

Chalnoth said:
Except if it was samples per clock, it'd be increasing with higher levels of FSAA, as otherwise it'd be no better than supersampling.

You're absolutely right. Not sure what I was thinking.

Chalnoth said:
So it appears that the X1k ROPs are in fact able to output more than just two color samples per clock, but are limited by how quickly the z-compares are done. So it's not worlds better than a normal 2-sample per pipe per clock design, but still better depending upon the efficiency of early-z optimizations.

This will be particularly true for games that make an initial z-only pass.

The thing is that any game which benefits from a z-only pass likely has shaders too complex to finish in a single cycle anyway (like doom3, quake4), so I don't see the advantage in outputting 4 colour samples per pipe per clock.

At first I was thinking along the lines of what XMas said:

Xmas said:
An implementation could take advantage of early accept for the Z-test, so it could write more Z-samples than it can test per clock.

This seemed to fit with the idea that ATI implemented a min-max Hi-Z scheme. However, from Dave's numbers it appears that the writing rate of Z-samples is no higher than 2 samples/clock. The only time you'll write colour without writing Z is with alpha blending, but then you've got bandwidth limitations.

This isn't making sense, unless ATI made their colour compressor really fast just for kicks, or if the data is wrong.

Mintmaster · Oct 26, 2005

I did some digging, and it turns out that ATI always did have this ability:

[url=http://www.beyond3d.com/forum//showthread.php?t=6254]Dave Baumann[/url] said:

Contrast that with R350:

Code:

                              1X       2X       4x      6X  
FFP - Pure fillrate         2742.52  1844.94  1652.42  1379.90
FFP - Z pixel rate          2536.93  2401.23  1449.13   742.06
FFP - Single texture        2605.49  1631.09  1434.26  1231.84
FFP - Dual texture          1365.43  1158.88  1148.74  1115.08
FFP - Triple texture         734.46   688.75   685.56   690.39
FFP - Quad texture           598.83   580.48   558.93   563.07
PS 1.1 - Simple             1490.28  1468.88  1452.69  1368.10
PS 1.4 - Simple             1490.28  1468.89  1452.98  1367.33
PS 2.0 - Simple             1490.29  1468.89  1452.63  1367.37
PS 2.0 PP - Simple          1490.30  1468.89  1452.76  1366.10
PS 2.0 - Longer              749.96   744.26   740.97   738.57
PS 2.0 PP - Longer           749.96   744.26   740.97   738.56
PS 2.0 - Longer 4 Reg        749.96   744.25   740.95   738.57
PS 2.0 PP - Longer 4 Reg     749.96   744.25   741.01   738.56
PS 2.0 - Per Pixel Light     111.58   111.28   111.08   110.92
PS 2.0 PP - Per Pixel Light  111.58   111.28   111.08   110.92

ATI had a single cycle colour compressor since R300. I suppose colour compression is pretty easy, so they're not wasting many transistors here.

KimB · Oct 26, 2005

Yeah, I think nVidia still has the limitation because, from what I understand, their color compression is applied at the memory controller interface, and doesn't have any connection to the ROP units.

But it is starting to sound like the situations where ATI's high no-z fillrate will be of use will be precious few indeed.

Jawed · Oct 26, 2005

Mintmaster said:
The thing is that any game which benefits from a z-only pass likely has shaders too complex to finish in a single cycle anyway (like doom3, quake4), so I don't see the advantage in outputting 4 colour samples per pipe per clock.

But doesn't a game using a Z-only pass (or multiple passes, one per light) for stencil shadow volumes run 0-length pixel shaders (i.e. there is no pixel shading to be done), whilst performing shadow determination?

:???:

Jawed

KimB · Oct 26, 2005

Jawed said:
But doesn't a game using a Z-only pass (or multiple passes, one per light) for stencil shadow volumes run 0-length pixel shaders (i.e. there is no pixel shading to be done), whilst performing shadow determination?

Yeah, but it's not outputting pixels

Mintmaster · Oct 26, 2005

Yup.

Jawed, R300 onwards can output many samples per pipe per clock (maybe even 6 if bandwidth isn't an issue) if just writing colour. The Z write rate is slower - 2 samples per pipe per clock. I remember reading 15 Gigasamples/s for R300, and that must be why.

Chalnoth was initially suggesting that if you do a Z-only pass, you don't need to write Z anymore in subsequent passes, so it could be advantageous there. But unless you actually output that many pixels per clock - and you don't in D3/Q4/most games - it doesn't help you.

Still, it's kind of cool that R520 can output 60 gigasamples per second.

Translation: 1024x768 w/4xAA at 20,000 fps

Jawed · Oct 26, 2005

I'm a little confused when you say samples per pipe per clock if just writing colour.

A sample is a Z/stencil value associated with a AA-sample position, isn't it?

When you say "just writing colour" do you mean that z-tests are off? Why write AA samples, then?

Still confused.

But I get what you mean that once Z is populated colour writes don't need to write Zs - but they still need to be compared, is that right?

Hmm, right now I'm thinking this is more fiddly than it's worth!

Jawed

KimB · Oct 26, 2005

Jawed said:
When you say "just writing colour" do you mean that z-tests are off? Why write AA samples, then?

Well, I think we've come to the conclusion that this implementation doesn't really help ATI that much. But it would still help, for example, when applying a full-screen effect to the normal framebuffer. But this situation is going to pretty much disappear as true HDR becomes common.

RejZoR · Oct 30, 2005

I don't know whats the fuss about HDR+AA. I was able to run Lost Coast with HDR and AA on R9600 XTR at quiet playable framerate (everything to high except models and textures to medium at 1024x768).

KimB · Oct 30, 2005

RejZoR said:
I don't know whats the fuss about HDR+AA. I was able to run Lost Coast with HDR and AA on R9600 XTR at quiet playable framerate (everything to high except models and textures to medium at 1024x768).

Because that doesn't use a FP16 framebuffer, and thus the Lost Coast's HDR implementation is very limited.

Moloch · Oct 30, 2005

limited in dynamic range or what?

KimB · Oct 30, 2005

To tell you the truth, I don't know exactly. But yes, limited in dynamic range is the most likely. There may also be limitations on precision.

Moloch · Oct 30, 2005

Chalnoth said:
To tell you the truth, I don't know exactly. But yes, limited in dynamic range is the most likely. There may also be limitations on precision.

What do the results of the limits of precision?

MSAA + HDR benchmarks ?

Jawed

KimB

Xmas

Porous

KimB

Xmas

Porous

KimB

Xmas

Porous

Mintmaster

Mintmaster

KimB

Jawed

KimB

Mintmaster

Jawed

KimB

RejZoR

KimB

Moloch

God of Wicked Games

KimB

Moloch

God of Wicked Games

Similar threads