R350/NV35 Z Fillrate with FSAA

I did some testing using zeckensacks incredible synthetic benchmark.

Here is a quote of the description of the tests i used with a Ti-4400 / 44.03:

This is a series of tests measuring raw, untextured pixel fillrate. Each of the tests in this category is performed by repeatedly drawing full screen sized quads, with varying buffer masks and depth test/stencil test combinations. A complete test series will be done in 32bpp. Portions of the test (everything that doesn't require stenciling) will be repeated in 16 bpp. This is done to both accomodate architectures that are severly bandwidth constrained in 32 bpp, and for architectures that can't activate full screen anti-aliasing in 16 bpp modes.
The first three subtests measure the rate at which color only, z only and pixels composed of both color and z can be produced and written to the frame buffer. There's no depth testing, but depth writes for the second and third subtest. [...] No buffer clears are performed during measurements.

ti4400.gif


~900 zixels/s for 4X MSAA means ~3600 written subzixels /s, which means about 13 written 16bit subzixels / clock. Not that far away from the 16.
 
OpenGL guy said:
They also claimed that 4x AA was free on the NV3x, and this hasn't been shown to be the case. In fact, the results shown in this thread show that only 2x AA is "free" on the NV3x.
So 4x AA is "free" on NV25, at least with 16 bit rendering. I dont know, if a 32-bit-ROP can be splitted into 2 16-Bit-ROPs, or if the bandwith the only limiting factor with 32 bit rendering. NV30 can calculate 8 "zexels" per clock, the chip got 16 ROPs, 2x AA is also "free".

NV25 can only calculate 4 z-only-pixels, with 2x or 4x there is no new serious drop, at least with 16 bit rendering. With textured Pixels, NV30's 4x AA is "essentially" free as far as i know.
 
aths said:
OpenGL guy said:
They also claimed that 4x AA was free on the NV3x, and this hasn't been shown to be the case. In fact, the results shown in this thread show that only 2x AA is "free" on the NV3x.
So 4x AA is "free" on NV25, at least with 16 bit rendering. I dont know, if a 32-bit-ROP can be splitted into 2 16-Bit-ROPs, or if the bandwith the only limiting factor with 32 bit rendering. NV30 can calculate 8 "zexels" per clock, the chip got 16 ROPs, 2x AA is also "free".

NV25 can only calculate 4 z-only-pixels, with 2x or 4x there is no new serious drop, at least with 16 bit rendering.
2x AA isn't even free at 16-bit color, in fact it's slower than 32-bit.
With textured Pixels, NV30's 4x AA is "essentially" free as far as i know.
From Dave's post:
Code:
FFP - Pure fillrate         1772.39  1743.24  1686.74   878.64 
FFP - Z pixel rate          3373.76  3148.66  1652.25  1571.46 
FFP - Single texture        1666.31  1203.08  1097.91   734.83
Fillrate free, maybe, but not free. Remember, Z rejection rate is important with and without texturing.
 
OpenGL guy said:
Fillrate free, maybe, but not free. Remember, Z rejection rate is important with and without texturing.
With 45.23, nV disabled Early Z Occlusion for NV25 at least in D3D. Anyway, it was never enabled by default, afaik.
 
Ailuros said:
Are you familiar with the term redundancy at all?

How much memory is it? Even redundant, if it's small it's small. Besides, we're looking at 64MB per chip, which even per chip would stay fantastic for some time.

As I said Tilers had loopback functions way before that and nowadays chips have advanced even beyond that. As for bandwidth I'd rather have a very effective bandwidth saving technique (if not a full TBDR str8 away) with 10GB/sec raw bandwidth, than 30GB/sec raw bandwidth with no advanced bandwidth saving techniques at all.

GeForce3's bandwidth savings weren't that insanely great. What, +20%-30% efficency over GeForce2?

Fixed function what on NV3x's? LOL :D What on God's green earth do we have Vertex Shaders for nowadays?

Well excuuuuuuuuse me, I'm just referring to the fact that NV3x has shown off some hellishly fast scores in 3DMark fixed function TCL texts, especially multiple lights, despite being about on par with R3x0 in vertex shaders.

I love it when people try to elegantly flip out of it......

o...k...

With the only other difference that you needed two chips to achieve what a single chip TBDR was able to. KYRO2 was able for 4xOGSS up to 1024*768 and 2x Vertical up to 1280*1024; where the K2 supposedly flips back to 2x in 1024 is beyond me. Shall I scratch another one on the scoreboard milady ;)

My mistake, I feel stupid now, having previously OWNED a K2.

All the games I tried at 1024 + 4x weren't particularly playable though, even to my standards. And I couldn't even play Quake3, half the textures were replaced with placeholders, looked like a Star Trek TNG holodeck. o_O Works great on my Radeon though. Mrr. Anyway, 4x at 1024 wasn't playable on my K2. Period.

Recursive texturing and an intelligent cache make Rampage do AF with a much lower bandwidth hit than usual, plus the TMU's are capable of trilinear per tick, unlike NV2x.

With the only other difference that it hardly deserves the term AF as a starter. Did you know that Xabre has antialiasing for free? 1x AA over the whole scene which is effectively what I could call in that sense AA for free too.

I was referring to REAL AF that time. i.e. Taking more real samples of the textures. Almost identical to GeForce3's implementation.

There's more aggregate texture bandwidth available, each TMU can fetch a trilinear sample every clock cycle, intelligent texture cache with recursive texturing... 8)

In wild imaginations anything is possible.

Wild imaginations = good for engineering.

I pulled that from memory, oh well, my point was that I know that they aren't truly 256-bit either.

Gee my oh my.... :rolleyes:

Buh? It was said earlier that only P10 and Parhelia have true 256-bit memory.
 
[quote="Tagrineth]Buh? It was said earlier that only P10 and Parhelia have true 256-bit memory.[/quote]
And no one agrees with you.
 
GeForce3's bandwidth savings weren't that insanely great. What, +20%-30% efficency over GeForce2?

That doesn't mean that Spectre wouldn't have been in an even better position with an advanced bandwidth saving technique on. Who knows there might not have been even the need for a multichip config after all hm? Did you see still the need for multichip configs in Fear, and if yes why not?

All the games I tried at 1024 + 4x weren't particularly playable though, even to my standards. And I couldn't even play Quake3, half the textures were replaced with placeholders, looked like a Star Trek TNG holodeck. o_O Works great on my Radeon though. Mrr. Anyway, 4x at 1024 wasn't playable on my K2. Period.

While 4xAA on the V5 especially in q3a was playable or what? Playable or not, it doesn't change the fact that in 32bpp the K2 managed roughly to yield similar performance as two chips on the V5.

There's more aggregate texture bandwidth available, each TMU can fetch a trilinear sample every clock cycle, intelligent texture cache with recursive texturing...

And? That's one theoretical advantage, while NV20 held it's own theoretical advantages against it too. Only difference being that NV20's real time numbers are extensively known and you've nothing else but theory to base anything on.

Wild imaginations = good for engineering.

We're both not engineers last time I checked. And 3dfx's "wild imagination" stopped at the V5 as in real products.
 
Hmm...

It seems to me that it can fairly be said that the Parhelia and R3xx cores are "256-bit chips", and that the associated highest spec products have "256-bit buses".

It also seems to me that two 128-bit chips can be said to be operating "as a 256-bit bus" if the 128-bit datapaths are utilized without effective redundancy in some type of data transfer. However, it can also said that there is not a "256-bit bus" with the associated challenges of trace layout to terminate all the paths for the 256-bits at the same place at one "end" for this instance.

AFAICS, a lot of this discussion is a dispute over people meaning different things from the above, when there are several valid ways of looking at things. :?: Of course, correct me if I'm wrong.
 
AFAICS, a lot of this discussion is a dispute over people meaning different things from the above, when there are several valid ways of looking at things. Of course, correct me if I'm wrong.

I wouldn't say that it needs any corrections.
 
OpenGL guy said:
Tagrineth said:
Buh? It was said earlier that only P10 and Parhelia have true 256-bit memory.
And no one agrees with you.

Feh, what is this, the Attack Tagrineth thread?

I didn't say that. I said "IT WAS SAID", not "I said".

And I agree on the semantics comment.

Ailuros said:
That doesn't mean that Spectre wouldn't have been in an even better position with an advanced bandwidth saving technique on. Who knows there might not have been even the need for a multichip config after all hm? Did you see still the need for multichip configs in Fear, and if yes why not?

Of course, Spectre would've benefitted greatly from more bandwidth savings. Anything would =)

And I think... after Rampage, multichip would of course be helpful and offer horrifying speed advantages, but I don't know if it would've been necessary anymore. It all depends, maybe nVidia would've been more aggressive with NV25 had Rampage appeared, resulting in 3dfx deciding to go ahead with dual-Fear? It all depends on how much performance 3dfx needed, and how much they got out of one chip. Personally, though, I believe that Fear would've been faster than GF4Ti as we know it, in single chip, and 3dfx probably would've passed on dual if that was the case.

While 4xAA on the V5 especially in q3a was playable or what? Playable or not, it doesn't change the fact that in 32bpp the K2 managed roughly to yield similar performance as two chips on the V5.

Of course. K2 is also a deferred renderer, though, while V5 is a very solid example of pure brute force. =) AFAIK V5 has few to no bandwidth savings of any kind. Oh, and yes, 4x AA at 800x600 in Q3A is totally playable on a V5.

And? That's one theoretical advantage, while NV20 held it's own theoretical advantages against it too. Only difference being that NV20's real time numbers are extensively known and you've nothing else but theory to base anything on.

Yeah, that's true enough. I suppose in practice, anything could've happened. There's also one more questionable element that we should consider, 3dfx's retarded-toward-the-end driver department... I wonder if they could've pulled their collective heads out of their asses in time to make a workable Rampage driver?

We're both not engineers last time I checked. And 3dfx's "wild imagination" stopped at the V5 as in real products.

True, but we're discussing engineering stuff and I said that in terms of 3dfx's designing the feature and possibly using it.
 
Personally, though, I believe that Fear would've been faster than GF4Ti as we know it, in single chip, and 3dfx probably would've passed on dual if that was the case.

Considering the prices and availability of ram (always hypothetical) of the projected period for release, I doubt it would have been clocked higher than 250-275MHz.

In terms of dx7 applications alone, I don´t believe that even a Kyro3 at 250MHz would have been able to outperform a Ti4600. With a respin for 300MHz (since it was on 13nm) probably yes. Those prototypes do exist, but it´s more likely that a ImgTec employee will sell his mother to you, than give you anything reliable on it.

I wonder if they could've pulled their collective heads out of their asses in time to make a workable Rampage driver?

Two choices: either release it with a completely bug-fested driver or delay.

True, but we're discussing engineering stuff and I said that in terms of 3dfx's designing the feature and possibly using it.

What that HSR thingy? It was a dumb idea since the very beginning. If it would have been such a smart idea I don´t see a single reason why NVIDIA wouldn´t have made use of it, along with all the other tidbits that made it into NV25 and past that, since the gentleman who originally fooled around with it, works actually there from 3dfx´s demise.
 
Ailuros said:
Personally, though, I believe that Fear would've been faster than GF4Ti as we know it, in single chip, and 3dfx probably would've passed on dual if that was the case.

Considering the prices and availability of ram (always hypothetical) of the projected period for release, I doubt it would have been clocked higher than 250-275MHz.

In terms of dx7 applications alone, I don´t believe that even a Kyro3 at 250MHz would have been able to outperform a Ti4600. With a respin for 300MHz (since it was on 13nm) probably yes. Those prototypes do exist, but it´s more likely that a ImgTec employee will sell his mother to you, than give you anything reliable on it.

Why wouldn't a K3 beat a Ti4600? K2 at 175MHz was about level with GF2 GTS, sometimes even approaching the Ultra... that's two pipelines with SDR versus 4 pipelines with 2 TMU's each with DDR, all at a higher clock speed.

I wonder if they could've pulled their collective heads out of their asses in time to make a workable Rampage driver?

Two choices: either release it with a completely bug-fested driver or delay.

Not necessarily, now that I think about it. The awful state of V5's driver could've been caused by focusing more on preparing Rampage's driver. But now we're getting to the "Baseless Speculation" point and I don't see a point in continuing this issue...

What that HSR thingy? It was a dumb idea since the very beginning. If it would have been such a smart idea I don´t see a single reason why NVIDIA wouldn´t have made use of it, along with all the other tidbits that made it into NV25 and past that, since the gentleman who originally fooled around with it, works actually there from 3dfx´s demise.

Simple, that HSR thingy could only get rid of ~30-50% tops, and that with a small visual error... any more removal and you get massive problems. People with V5's have -proved- that with relatively simple tweaking you can get no visible error and still gain significant performance with the feature as implemented in the 1.04.01 beta. nVidia never used it because I get the feeling GF3's Early Z Reject has similar removal without the errors.
 
Why wouldn't a K3 beat a Ti4600? K2 at 175MHz was about level with GF2 GTS, sometimes even approaching the Ultra... that's two pipelines with SDR versus 4 pipelines with 2 TMU's each with DDR, all at a higher clock speed.

Because you're way too optimistic with vaporware and that's what this whole debate is about. The K2 was a GF2MX competitor, embarassing higher GF2 models under occassions, yet apart from the MX it wasn't outperforming anything else overall. 1 TMU per pipe by the way.

Simple, that HSR thingy could only get rid of ~30-50% tops, and that with a small visual error... any more removal and you get massive problems. People with V5's have -proved- that with relatively simple tweaking you can get no visible error and still gain significant performance with the feature as implemented in the 1.04.01 beta. nVidia never used it because I get the feeling GF3's Early Z Reject has similar removal without the errors.

a) Early Z works ideally in first Z-pass only applications. That's not the only point where NV2x's gain in performance, the combination of them all is what really makes the difference. Similar removal on NV2x? ROFL ok anything you say.... :LOL:

b) We're talking about an illegitimate method to drop geometry here. It hasn't puzzled you one bit why let's say PowerVR hasn't resulted all these years to similar stupidity, when they have full hardware order independant HSR? Or does it escape your memory that the humble KYRO has a total of 32Z units?

c) Yes the "tweak" would be to cap the framerate at 60fps maximum and that's what the idea was all about. When we can maintain 60Hz and the rendering takes too long, let's drop some geometry. In other words let's just violate the games BSP, because we didn't have by that time a sufficient method to counter the competition...who are we kidding here? If Spectre was really to be such an Uber-card as you guys want us to believe, then why the heck did it need to result to such methods in the first place? Either do it properly out front and add some sort of occlusion culling or hierarchical Z or if you have to cheat then use clipping planes str8 away for the timedemos *shrugs*

Needless to say that if they had been so confident that they are that far ahead of NVIDIA they wouldn't even had considered a dual chip sollution for Spectre in the first place; they intended to abandon multichip sollutions past Spectre if all would have went according to plan. Primary reason being that they increase costs significantly and are times worse to sell, let alone scale down in pricing as time goes by.

***edit: I remember the "small visual errors" very well in some maps; a mess is more what would describe it even with 60fps capped and that conservative "tiling" (for heaven's sake...) on....
 
Simple, that HSR thingy could only get rid of ~30-50% tops, and that with a small visual error... any more removal and you get massive problems. People with V5's have -proved- that with relatively simple tweaking you can get no visible error and still gain significant performance with the feature as implemented in the 1.04.01 beta. nVidia never used it because I get the feeling GF3's Early Z Reject has similar removal without the errors.
Strange that people even try to talk this feature good, after the enormous bashing nVidia has received for 'cheating'. ;)
 
Ailuros said:
Why wouldn't a K3 beat a Ti4600? K2 at 175MHz was about level with GF2 GTS, sometimes even approaching the Ultra... that's two pipelines with SDR versus 4 pipelines with 2 TMU's each with DDR, all at a higher clock speed.

Because you're way too optimistic with vaporware and that's what this whole debate is about. The K2 was a GF2MX competitor, embarassing higher GF2 models under occassions, yet apart from the MX it wasn't outperforming anything else overall. 1 TMU per pipe by the way.

I didn't say the K2 had two TMU's per pipe, but I should've explicitly stated it had one each like I did for GF2.

a) Early Z works ideally in first Z-pass only applications. That's not the only point where NV2x's gain in performance, the combination of them all is what really makes the difference. Similar removal on NV2x? ROFL ok anything you say.... :LOL:

Similar to 30-50%? Or are you suggesting GF3 doesn't even remove 30% of occluded pixels? Or even more than 50%?

b) We're talking about an illegitimate method to drop geometry here. It hasn't puzzled you one bit why let's say PowerVR hasn't resulted all these years to similar stupidity, when they have full hardware order independant HSR? Or does it escape your memory that the humble KYRO has a total of 32Z units?

You answered your own question wrt PowerVR. Their occlusion culling works perfectly more or less all the time. As a side note, K2 has another advantage in having trilinear TMU's...

c) Yes the "tweak" would be to cap the framerate at 60fps maximum and that's what the idea was all about. When we can maintain 60Hz and the rendering takes too long, let's drop some geometry. In other words let's just violate the games BSP, because we didn't have by that time a sufficient method to counter the competition...who are we kidding here? If Spectre was really to be such an Uber-card as you guys want us to believe, then why the heck did it need to result to such methods in the first place? Either do it properly out front and add some sort of occlusion culling or hierarchical Z or if you have to cheat then use clipping planes str8 away for the timedemos *shrugs*

Amusingly enough though, it worked in controlled cases.

Needless to say that if they had been so confident that they are that far ahead of NVIDIA they wouldn't even had considered a dual chip sollution for Spectre in the first place; they intended to abandon multichip sollutions past Spectre if all would have went according to plan. Primary reason being that they increase costs significantly and are times worse to sell, let alone scale down in pricing as time goes by.

Now we're drifting into politics. Rampage's actual design philosophy was advanced features and quality over speed, that's why they were going to do dual-chip. Hell, it still only had one TMU per pipeline, though they were Trilinear capable.

***edit: I remember the "small visual errors" very well in some maps; a mess is more what would describe it even with 60fps capped and that conservative "tiling" (for heaven's sake...) on....

Personally I could never get it to work at all on my V5 while it was working.

But people at x3dfx have definitely been able to go from "unplayable" to "playable" (say, 40fps to 60fps, or 25fps to 40fps) with conservative settings. I know you find it hard to believe, but it has been done.
 
I didn't say the K2 had two TMU's per pipe, but I should've explicitly stated it had one each like I did for GF2.

All architectures up to Series4 had just one TMU AFAIK. I meant KYROIII or STG5500 if you prefer.

Similar to 30-50%? Or are you suggesting GF3 doesn't even remove 30% of occluded pixels? Or even more than 50%?

Just early Z in a non Z pass first application isn´t even going to reach 30%. Q3a in question has an average overdraw of 3.39, where early Z alone reduces it to 3.17. On NV2x and upwards you have to encount the memory controller, caching and what not since q3a benefits quite a bit from memory optimisations. If else riddle me why P4´s with their quad pumped bus are the q3a engine´s favourites.

You answered your own question wrt PowerVR. Their occlusion culling works perfectly more or less all the time. As a side note, K2 has another advantage in having trilinear TMU's...

Even a TBDR will be bound by the engine´s BSP and it´s downsides. Granted since the game is completely fillrate limited a K2 benefits a lot in fillrate, but it can´t overcome completely the idiotic behaviour of the BSP. There´s a reason why more and more games result to portal rendering these days, even if the manual implementation costs the developer quite a bit more time.

I don´t recall KYRO to have trilinear TMUs. It´s performance hit with pure trilinear on, compared to bilinear was quite large. However with texture compression enabled it used to fetch 16x samples from one mipmap on the fly, which minimized the performance penalty over plain bilinear. I used bilinear/2xVertical for most games when I had it.


Amusingly enough though, it worked in controlled cases.

So do cheats also. Your point?

Now we're drifting into politics. Rampage's actual design philosophy was advanced features and quality over speed, that's why they were going to do dual-chip. Hell, it still only had one TMU per pipeline, though they were Trilinear capable.

What´s the deal with the number of TMUs per pipe anyway? When you have loopback functions a secondary TMU per pipe back then would have been redundant. Plus the dual chip model with 16 TMUs would have most likely ran out of bandwidth and the read efficiency would drop into the gutter.
But people at x3dfx have definitely been able to go from "unplayable" to "playable" (say, 40fps to 60fps, or 25fps to 40fps) with conservative settings. I know you find it hard to believe, but it has been done.

I tried it both on the V3 and the V5. Both - the second to a lesser degree- used to produce artifacts in certain spots you just had to know where to look at; and that with the most conservative settings. It´s the same old story with about everything, just like the K2 showed some annoying poly gaps with Supersampling turned on, not many were able to see them. They were mostly visible in dm7 of q3a and in a couple of racing sims. It´s not a rare occurance anyway, if the gaps are white they´re just more apparent.
 
Ailuros said:
I didn't say the K2 had two TMU's per pipe, but I should've explicitly stated it had one each like I did for GF2.

All architectures up to Series4 had just one TMU AFAIK. I meant KYROIII or STG5500 if you prefer.
K2 had (still has) 2 texturing pipelines, each of which could apply one texture per clock.
I don´t recall KYRO to have trilinear TMUs.
Correct, in the normal sense. It did have a "pseudo-trilinear", for compressed textures only, than computed a lower MIP map level on-the-fly.
just like the K2 showed some annoying poly gaps with Supersampling turned on, not many were able to see them.
What? That doesn't make sense. The only way, AFAICS, that there could be gaps is if there were holes in the original model, in which case they'd also show up when AA was off (athough only ~1/4 as often).
 
K2 had (still has) 2 texturing pipelines, each of which could apply one texture per clock.

2*1 for K2
4*1 for K3

Where's my mistake there? I said TMUs not pipelines.

Correct, in the normal sense. It did have a "pseudo-trilinear", for compressed textures only, than computed a lower MIP map level on-the-fly.

Whole paragraph from my former post:

I don´t recall KYRO to have trilinear TMUs. It´s performance hit with pure trilinear on, compared to bilinear was quite large. However with texture compression enabled it used to fetch 16x samples from one mipmap on the fly, which minimized the performance penalty over plain bilinear. I used bilinear/2xVertical for most games when I had it.

Could be I phrased something badly in the bolded part, but where was I wrong there?

What? That doesn't make sense. The only way, AFAICS, that there could be gaps is if there were holes in the original model, in which case they'd also show up when AA was off (athough only ~1/4 as often).

I don't think we mean the same thing. It randomly appeared/disappeared only in very few cases and only with either 2xVertical or 2*2, horizontal didn't expose those. Cases I remember from the top of my head were fine white parallel lines in the far clipping distance on road textures (only in a few maps) in F1 2001 and NFS PU (randomly appearing and disappearing) and little white parallel and horizontal lines in a couple of spots in dx7 q3a. As I said there were no such side effects with 2x Horizontal.
 
Ailuros said:
K2 had (still has) 2 texturing pipelines, each of which could apply one texture per clock.
2*1 for K2

Where's my mistake there? I said TMUs not pipelines.
Err but the total is still 2. You said one TMU so I just wanted to make sure there was no confusion.

Could be I phrased something badly in the bolded part, but where was I wrong there?
I didn't say you were wrong! In fact, I distinctly said "correct"! :? All I was doing was adding some extra information for the benefit of other readers.

What? That doesn't make sense. The only way, AFAICS, that there could be gaps is if there were holes in the original model, in which case they'd also show up when AA was off (athough only ~1/4 as often).

I don't think we mean the same thing. It randomly appeared/disappeared only in very few cases and only with either 2xVertical or 2*2, horizontal didn't expose those. Cases I remember from the top of my head were fine white parallel lines in the far clipping distance on road textures (only in a few maps) in F1 2001 and NFS PU (randomly appearing and disappearing) and little white parallel and horizontal lines in a couple of spots in dx7 q3a. As I said there were no such side effects with 2x Horizontal.
Odd. Very Odd. There is no fundamental difference between horizontal and vertical filtering. The only thing I can think of is that maybe there are either
  • some T-Junctions in their models that happen to fall between samples when the image is rendered at a low-enough resolution or
  • There is a feature in their far clipping code that rounds the wrong way so that a miniscule gap occurs.
 
There is no fundamental difference between horizontal and vertical filtering.

Not that it´s related at all, but I did have the feeling that Vertical did offset LOD by -0.5 (estimated value), while Horizontal didn´t. I did ask back then, but no one ever answered that question, thus I´m rather uncertain about it. Speaking of is there any particular reason why PowerVR has for two samples exclusive modes for either x or y (tile size maybe being a reason?).

Err but the total is still 2. You said one TMU so I just wanted to make sure there was no confusion.

I meant 1 TMU per pipeline. I had clarified it one post before the one you noticed and missed making it clearer in the next one; my apologies.
 
Back
Top