Futuremark: 3DMark06

Dave Baumann said:
Yeah and no. With PCF, 4 taps, the depth compare and the averaged value is all a single operation and roughly the same cost as a single sample - so using multiples of those is likely to result in a better quality output. With Fetch 4, 4 taps is a sample; the cost of the fetching the 4 taps is the same as a single sample but the compare and average has to be done in the shader, will will probably end up being negligable overall. The point being, given that 4 taps per sample more or less the same cost as just 1 tap per sample then why not do it and sparse sampling?
Why you won't do this is that four contiguous taps in your sparse filter are meaningless.

I'm talking about shadow map filtering after PCF - where PCF is a technique that's predicated on taking contiguous samples from the shadow map, and a technique that's hard to tweak for quality. Which is why 3DMk06 doesn't use PCF in tests graphics tests 3 and 4.

The technique presented by ATI in the Siggraph presentation (as well as other places) uses a sparse-sampled kernel in preference to a large-density PCF filtering technique.

From page 18 of the presentation linked above:



• Grid-based PCF kernel needs to be fairly large to eliminate aliasing
– Particularly in cases with small detail popping in and out of the underlying hard shadow.​

• Irregular sampling allows us to get away with fewer samples
– Error is still present, only the error is “unstructuredâ€￾ and thus less noticeable
– Per-pixel spatially varying rotation of kernel is used to provide even more variation.

Multiple-contiguous sample taps don't make sense in a sparse-sampling kernel. At least, not as far as I can see.

Jawed


 
Xmas said:
Soft shadowing might be a banner case for DB, but hardly for per-pixel DB. With shadows you usually have large contiguous areas that are completely in or out. In fact it is one of those rare cases where NVidia's DB can be a huge performance gain despite its large granularity.
It's why the X1k material showed a tree (with real geometrical branches) being shadowed when discussing DB performance.

Jawed
 
Dave Baumann said:
Yeah and no. With PCF, 4 taps, the depth compare and the averaged value is all a single operation and roughly the same cost as a single sample - so using multiples of those is likely to result in a better quality output. With Fetch 4, 4 taps is a sample; the cost of the fetching the 4 taps is the same as a single sample but the compare and average has to be done in the shader, will will probably end up being negligable overall. The point being, given that 4 taps per sample more or less the same cost as just 1 tap per sample then why not do it and sparse sampling?
Compare and average are negligible overall is almost like saying shadow mapping itself is negligible overall, which we know it is not.
Shadow mapping only consists of three operations: sample, compare, average. And then the result is multiplied with the light color/intensity and passed on to the light interaction part of the shader.
PCF cobines those three into a single operation (though I'm not convinced it's single cycle [like point sampling is], which would be bandwidth limited anyway). Fetch4 accelerates sampling but does nothing to the compare and average steps.

Comparing four samples is a vec4 sub followed by a vec4 cmp, and averaging multple samples is a add4/dp4 cascade (unweighted, which in this case is fine).
So if Futuremark wanted to use fetch4 for the PS3.0 tests in 3DMark06 they would have had the controversial choice of taking x < 16 fetch4 samples to somehow match the average quality of 16 point samples, doing less texture sampling and more arithmetic.

Taking 16 fetch4 samples instead of 16 point samples would have increased quality but also the workload by 12 sub4, 12 cmp4 and 12 add4. In any case, they would have done the same for PCF, which actually means that, relatively speaking, ATI is better off with Futuremark not using Fetch4/PCF in the PS3.0 test at all.
 
When I said the compare / average for Fetch4 was likely to be fairly negligable I think we're looking at about 2 cycles on RV530 style hardware, some of which will be hidden by the instruction scheduling.
 
dizietsma said:
"This totally nullifies Ati's effort put in optimising bandwith usage"

and

"Because few people will bother doing separate game tests (mostly reviewers) most people will just run it, got their score and an idea about their systems capabilities."

For your second point

I can asure you that most people, presumably gamers, when assessing their systems capabilities will run the standard test and see 25fps and then not decide to apply AA/AF on top of that just to decrease their fps further. This is a theoretical test, not a practical one.

On your first point

No, because the SM3 tests where AA cannot be applied for nvidia are heavily gpu biased and not bandwidth limited at all I think. The SM2 tests might be but then the capability of each card can be measured in turn.

Well, I begin to live with the idea that 3DMark 2006 is just as is. :)
Looking on the bright side, X1800 users can test not just how their card will suck big time running future games, but they also can test how much worse it will perform with AA enabled. :)
 
Do we have a concensus that shadowing in graphics tests 3 and 4 is a level playing field for ATI and NVidia hardware?

Jawed
 
Dave Baumann said:
When I said the compare / average for Fetch4 was likely to be fairly negligable I think we're looking at about 2 cycles on RV530 style hardware, some of which will be hidden by the instruction scheduling.
Don't get me wrong, I do agree that Fetch4 and PCF should be used when available. However, it wouldn't have been as easy as saying "let's enable fetch4", because even if the three additional samples per fetch cost only half a cycle, that's going to be 8 cycles per pixel for a 16-sample sparse kernel. So Futuremark would have had to adjust the number of samples to somehow get comparable quality in all three paths. Which isn't always trivial.

btw, I also think that the decision to use either F4/PCF or a 4-tap rotated kernel in the PS2.0 test is somewhat poor in terms of comparable quality. However I'm not sure there's a better one. 3-tap maybe?
And the point sampling can be hidden by a sufficiently arithmetically complex shader, too.
 
Chalnoth said:
Like what, specifically?

Brilinear , 3d murk , clip planes ............. . And just generally poor iq , not because the cards couldn't look better , but because looking better wouldn't have been competitive .
 
Last edited by a moderator:
Cowboy X said:
Brilinear , 3d murk , clip planes ............. .
Okay, I don't quite see how you'd call the first cheating. I also have no idea what you're talking about with "3d murk," but the third was, from what I remember, only done in 3DMark. You specifically were attempting to call attention to cheating for non-DX9 games.
 
Yet more irony:

3DMark 2006 will probably show X1900 to greater advantage than any other available benchmark.
 
mrcorbo said:
Nice tease. :D

would have been even better with the dramatic pauses "There....is.....an...oth....errrr.


So what is different between the XT and XTX besides the few extra clocks?
 
I was wondering if the 3dmark rep could be so kind in explaining why rendering an entire 3d scene frame by frame using only the CPU is relevant to any kindof game. Thanks.
 
Back
Top