NVIDIA are NOT going to explain about 2xAA on the FX...

Dave H said:
If that's what you see, you need new glasses. Every single one (of course there's only 5 of them) of the GFfx web reviews has concentrated almost exclusively on 4xMSAA when doing AA benchmarking. Yes, Nvidia has pushed 2xMSAA in their own comparisons and in that MaxPC preview of a beta card, but none of the independent reviews has fallen for that "trick". In fact, the only two full-blown reviews to do any 2xMSAA benchmarks at all were Anand and [H], both of which published screenshots of 2xMSAA which make it look worse than it actually is.

I agree that the major reviewers have been astute enough to see through the ploy and not fall for it, either. I'll bet they liked it no more than I do, though. The fact that some, or even most, of reviewers see through it is to their credit--but it doesn't excuse nVidia for trying it in the first place--as far as I'm concerned.

And Anand was quite right to ignore it as it is doubtful what's being done at 2x FSAA is even FSAA. nVidia's not going to talk about it, according to [H], so we'll just have to figure it out on our own, won't we? (I'm quite sure I'm not the only one who wants to pin this down.)

To suggest that the meagre "promotion" of 2xMSAA that Nvidia has engaged in will influence anyone's buying decision or make anyone more likely to play at 2xAA for that matter is really silly. The only people who would even be exposed to those pre-release 2xAA benchmarks--i.e. the sort of people who follow the latest news on pre-release hardware--are exactly the sort of people who will have read all the reviews on release, and know all about the 2xAA issue. (Well, they might not be aware of the fact that the screenshots are lower quality than the actual output.) No one is in danger of being fooled here. The worst thing that can be said about it (IMO) is that it may have mislead people into waiting for GFfx instead of buying a 9700 Pro a month ago.

You do realize, don't you, that entirely too many people judge 3D cards on as flimsy and artificial a basis as 3D Mark 2001 SE results? Believe it or not, they do. At any rate, none of what you said actually addresses the issue of nVidia's culpability and complicity in this 2x FSAA/Quincunx debacle. If nVidia's not doing FSAA there, but doing something else, post filter, for instance, and attempting to provide a "look-alike" mode to pass off as FSAA simply to garner positive comparative performance reports--that's cheating--and I think deserves at least some sort of attention. I certainly can find no rational reason whatever for letting them off the hook.

Look at the contrast between 3dfx's use of the post filter and nVidia's apparent use of it: 3dfx was very open about it all and explained all of the details up front (but 3dfx never used it to simulate FSAA, either), whereas nVidia isn't talking, citing "trade secret" nonsense, which is double talk for "We don't want it to get out that we aren't actually using FSAA in these modes which represent themselves as FSAA."


Nvidia claims the method being used as a sort of trade secret. Yes, this is incomprehensibly lame. But it certainly in no way precludes Nvidia from releasing a utility to allow 2xMSAA screenshots to be taken which will match what actually shows up on-screen--which, after all, is the important thing here. I would be very surprised if Nvidia does not do this, because it's obviously possible (as with V3), and ostensibly in Nvidia's interests. If it's confirmed that Nvidia refuses to help with proper screenshots, then I will take your paranoid view on the matter. As the much more likely outcome is that Nvidia releases such a utility in the near future and certainly in time for retail card reviews at the end of the month, I think it's much smarter to wait until then so we can see what 2xMSAA on GFfx actually looks like before bashing it as some Communist plot. (Crazy idea, I know.)

Sorry, but your trick of trying to overemphsize what I've said in an effort to discredit the question itself--won't work. I'm *absolutely positive* screen shot software will be forthcoming which will capture the post filter results--the exact same thing happened when 3dfx first used the post filter with the V3 (as I've said, oh, umpteen times.) That will not explain what they are doing nor will it legitimize it as FSAA. That's the point. I don't know any communists, do you? *chuckle*


(a) No, it's not clear. Despite what you've implied [H]ocp says about on-screen 2xMSAA quality, what they actually say is that actual in-game IQ is "certainly not as lacking" as they claimed in the initial review. Albeit "not up to par" with R300 2xMSAA. Again, such a characterization is completely consistent with the only difference between the two being that R300's is gamma corrected and GFfx's is not. Or maybe it's more than that. Point is, absolutely nowhere does [H] imply that the IQ is "very poor" or anything close to it.

Quit apologizing for nVidia, will you? I'm getting queasy just reading it. The point--which you thickheadedly fail to grasp--is that the difference between on-screen image quality and screen shot image quality was *so slight* that *none* of the reviewers realized there was a *difference* between the two until *nVidia* pointed it out. Contrast that with 3dfx's use of the post filter (never used for FSAA by 3dfx) in which every reviewer who initially published screen shots *commented on the fact in the initial product reviews* that what they saw on the screen was much different than what the screen shots portrayed. If not for nVidia bringing it up no one would ever have noticed, except to say that nVidia had no 2x or QC FSAA modes--which is far as I'm concerned is still accurate.

Now, if nVidia was to have done the intelligent thing and introduce its use of the post filter as a *new feature* and identify the post filter in the process, and call "2x FSAA" something like "PFAA", or something similar, as procedures which nVidia thinks look as good as 2X and QC FSAA, and admit it has *dropped* 2x FSAA and QC in favor of this "new" procedure--then FINE, I have NO complaints. But nVidia hasn't done that, have they?

(b) Actually, there are some indications that it is at least somewhat similar to what they're doing at 4xMSAA. In particular, check out the following bench at in the [H] review:

Actually, there are not, because 4x FSAA shows up in the same frame-grab software that *does not* capture 2x FSAA or QC, and nVidia has already *admitted* it is not using the same methods. Interesting that they've always been very forthcoming on their FSAA methods--until now. In fact, I've read many a bragging PR where they went into copious detail on their methods of FSAA, never fearing a "competitor" would grab the action *chuckle*....(What a lame excuse that is--as if ATI *needs* to emulate nVidia's FSAA which is already much worse than what ATI is doing--and btw, ATI has entire *web site sections* devoted to explaining the general principles behind its FSAA. The idea that nVidia would use this excuse to duck the question is absurd--but that's exactly what they've done. )

BTW - when looking for the above bench, I noticed that [H] has already updated their review with proper 2xMSAA screens. And the image quality is certainly not "very poor", although it is indeed noticably worse than R300's 2xMSAA. To my non-expert eyes, it is abundantly clear that the GFfx is doing something very close to normal 2xAA, and it appears that the only difference is, indeed, the lack of gamma correction.

I'm delighted you think so and are happy. Be happy with your delusion if you want. Yes, it *is* much worse than ATI's 2x FSAA--probably because *it isn't FSAA*, but you have certainly convinced me that you don't care about that--at all.

Why is it that people always want to shoot the messenger? *chuckle*
 
I still do not understand you, WaltC.

Why is a post filter by definition "not AA"? I assume you haven't read my post yet?...I don't see a consideration of my points evident in your name calling of Dave H above.

It doesn't AA the contents of memory on the card, but that has nothing to do with whether it does AA the output to the screen. If the math and the data used is equivalent, the output to the montior is equivalent...i.e., the output IS the output of 2x MSAA. This isn't magic, despite the name of the card that did such a technique first.

This is all that really matters to the user of real time applications of it (they don't care about frame buffer contents, they care about monitor output).

We have indication that the data is the same...see my prior post.

We have no reason to assume the math isn't the same (AFAICS the Quincunx blur filter actually uses more math than simple 2x AA...so why assume less couldn't be done)...again, see my prior post.

So why are you calling him "thick headed" for addressing your assertion that poor 2x AA (compared to the R300) disqualifies it as AA? Perhaps the name calling could wait a bit, and you could provide your justification for your assumptions besides repeating the ones I've addressed repeatedly now?

Your justification for this is that 3dfx post filter resulted in so much improved quality compared to direct unprocessed frame buffer grabbing, that the difference was clear to reviewers. At the same time you keep insisting that a post filter method of AA is some sort of cheat, simply because either the difference is not as apparent as the case of the 3dfx post filter, or that Quincunx uses the post filter to reduce texture quality. With absolutely no qualms about the contradiction that the only basis for this assumption is that it is using a post filter, so therefore it can only be one or the other of what was done before...

You have evidence of a post filter increasing image quality, and evidence of a post filter reducing image quality, therefore based on this evidence, post filters can't do AA. As I've pointed out, I don't get where you are getting that conclusion from. Please enlighten me.

You keep using the term "blur" and "cheat" in combination with "2x AA" as if Quincunx using the post filter for a blur is the only mathematical operation a post filter can do, when you have provided absolutely no basis for that belief at this time.

If you simply repeat that the difference between 2x AA and partial buffer data isn't as marked as the difference between the post filter 3dfx used and partial buffer data, I'm just going to end up assuming you don't want to put any thought into this. So, please respond in some other way, and hopefully one that makes sense, and hopefully acknowledges that math has more bearing on this discussion that where the math is done.
 
Thx Demalion! :)

In any case, now that the corrected pics are up I think it's pretty undisputed that the NV30 does proper 2xMSAA, so I'll go ahead and consider the main thread of this discussion over.

To clean up a couple loose ends:

Why would 4xMSAA also suffer from the larger framebuffer footprint? Perhaps three of the four samples are blended in the framebuffer, and the fourth is blended in the post filter. This seems very reasonable to me--on the one hand, its doubtful the post filter could easily blend all four samples, on the other, if you've already got the functionality, why not use it? The evidence against this theory is that Nvidia itself doesn't seem to be claiming that on-screen 4xMSAA quality is any different from what's in the screenshots. Then again, considering Nvidia seems to be channelling all their website relations through the world's most idiotic PR hacks, it's possible they've got it wrong.

As for the assertion that all the websites immediately recognized that Voodoo3's 22-bit post-filter was the real deal, Walt, your memory is as foggy as your eyesight. (You seem to be getting old! ;) ) The majority pronounced it an outright sham, PR trickery, something completely impossible. Read Kristof's series of articles investigating the 22-bit claim to see how the best and most technical site handled the issue; you'll also catch plenty of references to how poorly the rest of the web did.
 
For color compression to work (currently...), the same size frame buffer (for back buffers) is needed as if you were supersampling, or atleast that is my understanding. The memory footprint represented for the GF FX may be no more than that, and I don't recall any sudden performance hitsfor the GF FX not reflected on the 9700 that wouldn't be explained by the use of supersampling (and therefore the breakdown of the color compression) but instead would indicate the front buffer was larger size. Maybe I have to go back and look at the benchmarks again...

I'm not sure how the color compression works in the details...but perhaps there is an explanation in the forums somewhere...hmm. I'd guess the simplest implementation would be a 1 bit mask for each pixel for "all values duplicate"/"all values distinct", cached for 64 bit burst writing, but it's probably something more effective.
Something is tickling my mind, like some ATI person either gave a hint about this or was asked and infuriatingly didn't give a hint... :-? :)arrow: off to do a search).
 
Demalion-

You're right; the apparent memory footprint difference in that benchmark is probably due to NV30's different method of color compression from R300. Or something else. In any case, thinking about it I'm not sure it matters where you blend the sub-pixel samples (i.e. in framebuffer or post-framebuffer) w.r.t. total memory footprint.
 
I only meant that the GF FX using the RAMDAC for AA processing required a larger front buffer (i.e., same size as the back buffer) for the RAMDAC to sample from.

I.e., for other cards (like the 9700), for 2xAA you'd have 2x the RAM taken up for backbuffers (what was rendered to). However, they'd only need 1x the RAM for the front buffer (as they perform their AA blend by copying it to the front buffer).

For the RAMDAC to perform the blend, the front buffer allocation size would still have to be 2x, trading an additional 1x RAM storage of the front buffer for circumventing the bandwidth/latency cost of an additional 1x RAM write + 1x RAM read:

blend AA = blend + ramdac
blend = 2x read (from back buffer) + 1x write (to front buffer) + latency effect of GPU performing blend (assumes GPU could be writing something else instead)
ramdac = 1x read (from front buffer)
storage = 1x RAM used (for front buffer)

postfilter AA = ramdac
ramdac = 2x read (from front buffer)
storage = 2x RAM used (for front buffer)

Difference between blend and postfilter AA:

blend AA - postfilter AA = 1x write + 1x read + latency effect of GPU performing blend - 1x RAM used


This (the memory cost of n-1*number of pixels*depth bits) seems the likely reason why the GF FX is limited to 2x "post filter" blend...atleast on the 128MB cards.

EDIT: It also just entered my head that both 2x reads above could benefit from color compression.

All this is AFAICS, and might be prone to correction. :p
 
demalion-

Thanks for clearing that up. :)

Reading your explanation, it strikes me that doing the blend in a post filter actually seems like a much smarter method, particularly for a comparatively bandwidth-starved card like the GFfx. Having never thought it through before, I didn't realize that whereas without AA you can flip the back and front buffers simply by changing a pointer (I think), with the way most cards do AA you need to read the entire contents of the backbuffer (one copy of the framebuffer for each subsample) back to the GPU and then write the blended framebuffer out to the frontbuffer! (If I understand you correctly.)

To put some numbers on this, let's say on some game the 9700 Pro gets 60fps at 1600x1200 with 4xMSAA. Each frame you need to read back the 4 1600x1200 backbuffers (30.72 MB ignoring color compression) and write 1 1600x1200 frontbuffer (7.68 MB). At full bandwidth utilization, this costs you 1.94 ms every frame. If you didn't have to pay this blending cost, instead of 60fps, you'd be getting 68fps. I'm not sure how color compression comes into this, but even if it managed to perfectly compress the backbuffer 4:1, you're still losing 3.2fps.

Presumably avoiding this sort of thing (of course the above example doesn't fully apply, because GFfx does most if not all blending in the framebuffer at 4xMSAA) is the reason GFfx, despite a much smaller bandwidth/fillrate ratio, actually takes a smaller hit for MSAA than the 9700. Frankly I think it's amazing Nvidia isn't touting post filter AA as a bandwidth saving feature. Of course, perhaps they'd rather emphasize 2xMSAA performance but pretend their advantage is due to something more than a clever bandwidth-saving trick.

Anyways, the extra memory footprint of the post filter AA technique would not appear to be behind the unusual benchmark scores I posted, because R300 at 4xMSAA should still have a larger memory footprint (4x backbuffer + 1x frontbuffer) than NV30 at 2xMSAA (2x back + 2x front), and yet the 9700 with 4xAA does not show the dropoff in performance that the GFfx with 2x does. Unless this is something to do with GFfx's color compression being specifically tuned for 4xMSAA and thus not helping enough/any at 2x??

Anyways, interesting stuff.

EDIT: Doh! Forgot to take into account the extra bandwidth usage between frontbuffer and RAMDAC if you haven't already consolidated your subsample buffers. (Even though you explicitly included it in your post!) Seems to me that, at any level of MSAA, the bandwidth savings for doing all blending in a post filter is always exactly 2x the size of each subsample buffer. OTOH, if GFfx were to do 4x MSAA by blending 3 subsample buffers on chip and doing the last blend in the post filter they would save...precisely no bandwidth. So I guess they really do only use the post filter for 2x and Quincunx.
 
So you're postulating that this strange method of AA is a way for Nvidia to get around the bandwidth limitations of the GFFX?
 
Dave H said:
I didn't realize that whereas without AA you can flip the back and front buffers simply by changing a pointer (I think), with the way most cards do AA you need to read the entire contents of the backbuffer (one copy of the framebuffer for each subsample) back to the GPU and then write the blended framebuffer out to the frontbuffer! (If I understand you correctly.)

I don't think that's neccessary.

Think about the following scheme (2x AA):

buffer A - Front buffer
buffer B - Back buffer part one - contains the 1st subsample for AA
buffer C - Back buffer part two - contains the 2nd subsample for AA

All three buffers are 1x size.
The following steps are taken:
1.) Render into B&C using multisampling
2.) Blend B&C into B
3.) Flip A with B.

The first obvious advantage is that there will be no tearing, because the blending is not targeted to the front buffer, and the flip could be VSync'd. (And it can be extended to triplebuffering.)

But the non-obvious advantage?

Let's consider the following simple frame-buffer compression: hold 1 extra bit for every pixel, if it's 0 then the 2 sub-samples are the same (stored in buffer B), it its 1 then the 2 sub-samples are different (stored in B&C).
The blending have to be performed for "split-pixels" only (bit 1)!
This can turn out to be a big saving in the blending operation.
 
People see what they want to see, and so subjective ideas about image quality are kind of irrelevant, people always said Nvidia's 2D sucked, then Leadtek produced cards with very sharp images, I personally love my leadtek 4200, but someone else can love matrox, or ATI's 2d (I know it is not 2D, but that was what it used to be called, so now that means something to most)
 
demalion said:
blend AA = blend + ramdac
blend = 2x read (from back buffer) + 1x write (to front buffer) + latency effect of GPU performing blend (assumes GPU could be writing something else instead)
ramdac = 1x read (from front buffer)
storage = 1x RAM used (for front buffer)

postfilter AA = ramdac
ramdac = 2x read (from front buffer)
storage = 2x RAM used (for front buffer)

Difference between blend and postfilter AA:

blend AA - postfilter AA = 1x write + 1x read + latency effect of GPU performing blend - 1x RAM used

Not quite:

Blend AA (ignoring overhead)
blend = rendering_rate * (2x read + 1x write)
ramdac = refresh_rate * (1x read)

Postfilter AA (it has overhead too but ignoring it again)
ramdac = refresh_rate * (2x read)

BlendAA - PostfilterAA = rendering_rate * (2x read + 1x write) - refresh_rate * (1x read)

So Blend AA uses less bandwith then Postfilter AA if
rendering_rate < refresh_rate / 3
 
Ohhhhhh.
Where were you before I repeated the error all over the place? :(

:LOL:

Is this the same for DVI? I'd think a digitial standard like that would present the perfect opportunity to save bandwidth usage (if RAM capacity the size of the maximum screen display wasn't a problem for the monitor manufacturer...).

But then, hardware cursors might confuse things if they are enabled during 3D operations (I don't think overlay is) unless DVI had some funky hardware cursor channel.

EDIT: Oh well, the specs don't seem to indicate any type of monitor buffer, and that would probably have monitor manufacturers crying murder unless RAM started growing on trees. Why would they add cost to help a video card bottleneck that would not reflect on them?
 
If you're using 2x FSAA and your rendering rate is less than a third of your refresh rate, you're probably being limited by something other than the bandwidth (unless you're using an ungodly high refresh rate).
 
Hyp-X said:
demalion said:
blend AA = blend + ramdac
blend = 2x read (from back buffer) + 1x write (to front buffer) + latency effect of GPU performing blend (assumes GPU could be writing something else instead)
ramdac = 1x read (from front buffer)
storage = 1x RAM used (for front buffer)

postfilter AA = ramdac
ramdac = 2x read (from front buffer)
storage = 2x RAM used (for front buffer)

Difference between blend and postfilter AA:

blend AA - postfilter AA = 1x write + 1x read + latency effect of GPU performing blend - 1x RAM used

Not quite:

Blend AA (ignoring overhead)
blend = rendering_rate * (2x read + 1x write)
ramdac = refresh_rate * (1x read)

Postfilter AA (it has overhead too but ignoring it again)
ramdac = refresh_rate * (2x read)

Is "overhead" in delays or RAM storage? Or something else?

If delays, do you mean there is a delay when calculating the blend? I'd be confused by that as I'd expect having dedicated, but simple, hardware in the design for that that would sit idle at other times.

For Blend AA, what about the time the back buffer is blocked for writes during a blend? You removed that type of consideration from the equation AFAICS....do you presume triple buffering?

BlendAA - PostfilterAA = rendering_rate * (2x read + 1x write) - refresh_rate * (1x read)

So Blend AA uses less bandwith then Postfilter AA if
rendering_rate < refresh_rate / 3

Which would be true sometimes for 2x AA at low fps... so perhaps not so smart to use post filter (atleast for more than 1x read filtering) for NV31/NV34 sometimes? I could see easily having 85 Hz at 1280x1024, and a NV31 or NV34 running at 28 fps or lower at that resolution (my monitor is easily capable of that and it is fairly old now).

Perhaps I have to think about it a bit more.
 
demalion said:
Is "overhead" in delays or RAM storage? Or something else?

Well you said "latency effect of GPU performing blend (assumes GPU could be writing something else instead)".
I mean it just like that.
The RAMDAC reading the memory can cause delays paralelly with bandwidth limited rendering.

If delays, do you mean there is a delay when calculating the blend? I'd be confused by that as I'd expect having dedicated, but simple, hardware in the design for that that would sit idle at other times.

For Blend AA, what about the time the back buffer is blocked for writes during a blend? You removed that type of consideration from the equation AFAICS....do you presume triple buffering?

I presumed that blending is performed with the fastest rate the memory bus can handle, so there's no point in trying to perform anything parallel to it.
 
Back
Top