MSAA + HDR benchmarks ?

Hmm, is there an article on B3D that describes explicitly what this test is doing?

I'll be honest and say this particular part of B3D's reviews has never meant a great deal to me - i.e. I've never understood it, whether as a tool for comprehending AA scaling on an architecture, or for comparing competing architectures.

In other words, I just look to the game benchmarks to get a feel for AA performance...

Jawed
 
Xmas said:
An implementation could take advantage of early accept for the Z-test, so it could write more Z-samples than it can test per clock.
I was thinking more along the lines of Equal tests, where no value need ever be written to the z-buffer. It seems to me that that situation could allow for a very nice speedup for hardware that allows it. This situation would be the norm for rendering where you do an initial z-pass, which would be highly useful anyway for optimizing performance for long shaders.
 
Chalnoth said:
I was thinking more along the lines of Equal tests, where no value need ever be written to the z-buffer. It seems to me that that situation could allow for a very nice speedup for hardware that allows it. This situation would be the norm for rendering where you do an initial z-pass, which would be highly useful anyway for optimizing performance for long shaders.
But OpenGL guy wrote that the z-compare is limiting, not z-write. The Equal test has to be done per sample, you can do early tile reject but not early tile accept. Besides, there's no point increasing the sample output rate for long shaders.
 
Xmas said:
But OpenGL guy wrote that the z-compare is limiting, not z-write. The Equal test has to be done per sample, you can do early tile reject but not early tile accept.
Right, but the z-buffer accesses should be decoupled from the color buffer accesses if you want to make optimal use of the pixel pipelines. But I suppose it's all pretty much a mute point anyway: most of the time when you're not running any sort of pixel shader (at least, not one more than a couple of instructions long), you're going to be completely limited by z-accesses anyway, as the color buffer might not be written to at all (here I'm talking about shadow volume and shadow map rendering).

One place that accelerated color writes with FSAA could help would be with full-screen writes, such as in applying bloom effects, but those will probably have shaders too long for it to make a difference (all it would take is a shader that takes 3 clocks to execute).
 
Xmas said:
You don't want to use a multisampled render target for full screen writes.
You might if you're applying a bloom effect to the framebuffer (i.e. not using a rendered texture).
 
Chalnoth said:
You might if you're applying a bloom effect to the framebuffer (i.e. not using a rendered texture).
And how do you add a bloom effect without using the rendered scene as a texture?
 
Chalnoth said:
Except if it was samples per clock, it'd be increasing with higher levels of FSAA, as otherwise it'd be no better than supersampling.
You're absolutely right. Not sure what I was thinking.
Chalnoth said:
So it appears that the X1k ROPs are in fact able to output more than just two color samples per clock, but are limited by how quickly the z-compares are done. So it's not worlds better than a normal 2-sample per pipe per clock design, but still better depending upon the efficiency of early-z optimizations.

This will be particularly true for games that make an initial z-only pass.
The thing is that any game which benefits from a z-only pass likely has shaders too complex to finish in a single cycle anyway (like doom3, quake4), so I don't see the advantage in outputting 4 colour samples per pipe per clock.

At first I was thinking along the lines of what XMas said:
Xmas said:
An implementation could take advantage of early accept for the Z-test, so it could write more Z-samples than it can test per clock.
This seemed to fit with the idea that ATI implemented a min-max Hi-Z scheme. However, from Dave's numbers it appears that the writing rate of Z-samples is no higher than 2 samples/clock. The only time you'll write colour without writing Z is with alpha blending, but then you've got bandwidth limitations.

This isn't making sense, unless ATI made their colour compressor really fast just for kicks, or if the data is wrong.
 
I did some digging, and it turns out that ATI always did have this ability:
[url=http://www.beyond3d.com/forum//showthread.php?t=6254]Dave Baumann[/url] said:
Contrast that with R350:

Code:
                              1X       2X       4x      6X  
FFP - Pure fillrate         2742.52  1844.94  1652.42  1379.90
FFP - Z pixel rate          2536.93  2401.23  1449.13   742.06
FFP - Single texture        2605.49  1631.09  1434.26  1231.84
FFP - Dual texture          1365.43  1158.88  1148.74  1115.08
FFP - Triple texture         734.46   688.75   685.56   690.39
FFP - Quad texture           598.83   580.48   558.93   563.07
PS 1.1 - Simple             1490.28  1468.88  1452.69  1368.10
PS 1.4 - Simple             1490.28  1468.89  1452.98  1367.33
PS 2.0 - Simple             1490.29  1468.89  1452.63  1367.37
PS 2.0 PP - Simple          1490.30  1468.89  1452.76  1366.10
PS 2.0 - Longer              749.96   744.26   740.97   738.57
PS 2.0 PP - Longer           749.96   744.26   740.97   738.56
PS 2.0 - Longer 4 Reg        749.96   744.25   740.95   738.57
PS 2.0 PP - Longer 4 Reg     749.96   744.25   741.01   738.56
PS 2.0 - Per Pixel Light     111.58   111.28   111.08   110.92
PS 2.0 PP - Per Pixel Light  111.58   111.28   111.08   110.92
ATI had a single cycle colour compressor since R300. I suppose colour compression is pretty easy, so they're not wasting many transistors here.
 
Yeah, I think nVidia still has the limitation because, from what I understand, their color compression is applied at the memory controller interface, and doesn't have any connection to the ROP units.

But it is starting to sound like the situations where ATI's high no-z fillrate will be of use will be precious few indeed.
 
Mintmaster said:
The thing is that any game which benefits from a z-only pass likely has shaders too complex to finish in a single cycle anyway (like doom3, quake4), so I don't see the advantage in outputting 4 colour samples per pipe per clock.
But doesn't a game using a Z-only pass (or multiple passes, one per light) for stencil shadow volumes run 0-length pixel shaders (i.e. there is no pixel shading to be done), whilst performing shadow determination?

:???:

Jawed
 
Jawed said:
But doesn't a game using a Z-only pass (or multiple passes, one per light) for stencil shadow volumes run 0-length pixel shaders (i.e. there is no pixel shading to be done), whilst performing shadow determination?
Yeah, but it's not outputting pixels :)
 
Yup.

Jawed, R300 onwards can output many samples per pipe per clock (maybe even 6 if bandwidth isn't an issue) if just writing colour. The Z write rate is slower - 2 samples per pipe per clock. I remember reading 15 Gigasamples/s for R300, and that must be why.

Chalnoth was initially suggesting that if you do a Z-only pass, you don't need to write Z anymore in subsequent passes, so it could be advantageous there. But unless you actually output that many pixels per clock - and you don't in D3/Q4/most games - it doesn't help you.

Still, it's kind of cool that R520 can output 60 gigasamples per second. :D
Translation: 1024x768 w/4xAA at 20,000 fps :LOL: :LOL:
 
I'm a little confused when you say samples per pipe per clock if just writing colour.

A sample is a Z/stencil value associated with a AA-sample position, isn't it?

When you say "just writing colour" do you mean that z-tests are off? Why write AA samples, then?

Still confused.

But I get what you mean that once Z is populated colour writes don't need to write Zs - but they still need to be compared, is that right?

Hmm, right now I'm thinking this is more fiddly than it's worth!

Jawed
 
Jawed said:
When you say "just writing colour" do you mean that z-tests are off? Why write AA samples, then?
Well, I think we've come to the conclusion that this implementation doesn't really help ATI that much. But it would still help, for example, when applying a full-screen effect to the normal framebuffer. But this situation is going to pretty much disappear as true HDR becomes common.
 
I don't know whats the fuss about HDR+AA. I was able to run Lost Coast with HDR and AA on R9600 XTR at quiet playable framerate (everything to high except models and textures to medium at 1024x768).
 
RejZoR said:
I don't know whats the fuss about HDR+AA. I was able to run Lost Coast with HDR and AA on R9600 XTR at quiet playable framerate (everything to high except models and textures to medium at 1024x768).
Because that doesn't use a FP16 framebuffer, and thus the Lost Coast's HDR implementation is very limited.
 
To tell you the truth, I don't know exactly. But yes, limited in dynamic range is the most likely. There may also be limitations on precision.
 
Chalnoth said:
To tell you the truth, I don't know exactly. But yes, limited in dynamic range is the most likely. There may also be limitations on precision.
:)
What do the results of the limits of precision?
 
Back
Top