Outstanding performance of the NV40 = old school 3dfx mojo?

At least if they were called hexas, octs, dodecs, hexadecs or icosas it would be less confusing with a rarely used primitive type.
 
Dio said:
At least if they were called hexas, octs, dodecs, hexadecs or icosas it would be less confusing with a rarely used primitive type.
Good idea, except that "octopipe" might be confused with a James Bond movie :)

Anyway, the whole "increasing the size of your quads" thing reminds me of an old math joke:
[joke]Written on a crazed mathematician's chalk board: 1 + 1 = 3 for very large values of 1.[/joke]
My corollary is that it's also true for very small values of 3. ;)
 
Ailuros said:
I can see a NV3x being merely twice as fast as a K2 in Fablemark, yet then again the latter has already as many Z/stencil units as NV40 should have.
Keep in mind that Fablemark is also not optimized for an NV3x-style architecture. It does ambient lighting at the same time as the first z-pass, which destroys the NV3x's ability to accelerate that first z-pass. This is not the case with DOOM3, which will be one of the few game engines to use stencil shadow volumes.

There's a chance that IHVs will leave in the future AA to the ISVs for which I don't disagree with. MSAA similar in efficiency up to how many samples exactly? 4x maybe 8x in a very generous case?

Over 100MB of buffer consumption alone for 6x MSAA.
Huh? I really don't understand what you're trying to say there. in that first paragraph. Anyway, as for the buffer consumption, high triangle densities can require similar amounts of buffer space, and I still would rather have fixed resource requirements than a variable one.

Presupposition there is a z-only pass.
That's up to software developers. As shaders get longer, I expect initial z-passes to become common. This will have a twofold benefit:
1. Reduces the need to order primitives for rendering (though it will still help with that initial z-pass, it won't help with the rest of rendering).
2. As shader lengths get longer, the cost for doing an initial z-pass will become significantly less than the pixels that are overwritten without one.

  • IMR's also don't need to overwrite values. See above.
No they never actually do. Who you're kidding anyway?
Well, I suppose IMR's will need to overwrite values for blending, but they won't need to in most circumstances. So yes, I suppose blending is one inherent benefit to TBDR's that just can't be solved in an IMR, but I'm not sure it's enough...

Why is it necessary on the other hand to unify grids in future APIs? Maybe just maybe because there has to be a more efficient way to handle very long shaders coupled with very short shaders at the same time?
I get the feeling that you were really tired (or something) when you made this post. Regardless, the benefit of total decoupling would be that if during part of the frame rendering is vertex-limited, and during another part it's pixel-limited, you'd end up wasting both pixel and vertex power.

But this would become a non-issue if some future architecture used the same pipelines for both pixel and vertex processing, as you'd never have pixel pipelines waiting for vertex pipelines, or vice-versa, since that pixel pipeline would just start working on vertex data instead.

Sad part is that without any real hardware it is and will remain a moot point. It isn't either that IMRs are supposedly near to "perfection"; it's just one particular brand.
Which I never said. They've just either solved or are on the road to solving most of the problems that TBDR solves, so I don't see much reason to go all the way to TBDR and add on other potential problems.
 
They've just either solved or are on the road to solving most of the problems that TBDR solves, so I don't see much reason to go all the way to TBDR and add on other potential problems.

Trouble is that you can't know whether the hypothetical potential problems have been solved or circumpassed in the meantime.

I'll try an alternative highly hypothetical and speculative approach:

assume you get similar performance overall between an IMR and a TBDR, whereby the latter has N amount of raw memory bandwidth, while the first N*2 or higher. That would give a bandwidth advantage to the TBDR, wouldn't it?

Huh? I really don't understand what you're trying to say there. in that first paragraph. Anyway, as for the buffer consumption, high triangle densities can require similar amounts of buffer space, and I still would rather have fixed resource requirements than a variable one.

Assume we'll see in the foreseeable future partial SSAA usage in games for parts of the scene where MSAA cannot effectively compensate (alphas or render2texture as examples), then it's a far better idea to leave anti-aliasing entirely to the game developer and not force it through the driver control panel; that's what I meant.

Parameter bandwidth is IMHO just another illusionary wall, just as the supposed bandwidth wall myth for IMRs in the past used to be. I cannot imagine a higher requirement (and that being extremely generous) more than 10% of the total framebuffer and that would be ~25MB with 256MB packed accelerators. That's exactly 1/4th with what you need currently for 6xMSAA + colour compression. Try 8x, 16x samples or even more instead.

That's up to software developers.

Agreed.

Well, I suppose IMR's will need to overwrite values for blending, but they won't need to in most circumstances. So yes, I suppose blending is one inherent benefit to TBDR's that just can't be solved in an IMR, but I'm not sure it's enough...

How about blending with float render targets?
 
Chalnoth said:
Keep in mind that Fablemark is also not optimized for an NV3x-style architecture. It does ambient lighting at the same time as the first z-pass, which destroys the NV3x's ability to accelerate that first z-pass.
I think you need to think a little more clearly about the Fablemark case. If it was NOT to do a Z-only pass first, it would still have to do a colour pass later to do ambient lighting. Therefore, the Z-only pass is only an absolute consumer of time, assuming that their geometry is reasonably well front-back sorted. So this case is not inefficient - it is more efficient without the Z-pass on all hardware.

Z-only passes are quite inefficient unless you are doing stencil shadows with a global lighting algorithm, as Doom3 is. Or unless there's no front-back sorting, but that's just sheer laziness and inefficiency.
 
Ailuros, yeah, stop proving me wrong. Please? :p
Dio said:
Z-only passes are quite inefficient unless you are doing stencil shadows with a global lighting algorithm, as Doom3 is. Or unless there's no front-back sorting, but that's just sheer laziness and inefficiency.
Out of curiosity - what if you managed to efficiently batch TONS of DIP calls, which would otherwise have to be separate? This could become more frequent if branching became more economical. Then, you couldn't do good front-back sorting, and if your pixel shaders were long, an early Z pass would be more, let us say, profitable.
Or am I wrong here? :)


Uttar
 
Batches are a diminishing returns thing. Sure, 10 primitives per batch is bad, but 100 is tolerable and 500+ is don't care. The cost of submitting them all twice would certainly be more ;). And branching certainly isn't much of a solution to batch size issues (the real solution, and one that everyone's working on now, is to reduce state overheads).

(I do get annoyed still seeing apps rendering the sky first. Crikey, I was telling people not to do that when I was in devrel 5 years ago. It's not hard not to do it).
 
Dio said:
I think you need to think a little more clearly about the Fablemark case. If it was NOT to do a Z-only pass first, it would still have to do a colour pass later to do ambient lighting. Therefore, the Z-only pass is only an absolute consumer of time, assuming that their geometry is reasonably well front-back sorted. So this case is not inefficient - it is more efficient without the Z-pass on all hardware.
That's only because the program uses no shaders. With shaders you can do ambient lighting along with other light passes. This would be particularly true with a rendering technique that uses MRT's to do all lighting by drawing a single screenspace quad.
 
Yes, and you can fill in your Z buffer at the same time, and if you're reasonably front to back sorted it will STILL be significantly faster on all hardware. That's even before you start using a complex vertex shader where Z-passes get a lot slower and/or more complicated.
 
Dio said:
Z-only passes are quite inefficient unless you are doing stencil shadows with a global lighting algorithm, as Doom3 is. Or unless there's no front-back sorting, but that's just sheer laziness and inefficiency.

Sorry if i'm disturbing your discussion here but do you know if f.e the Unreal Engine 3 will use similar techniques as Doom 3 in this case ?
 
Dio said:
Yes, and you can fill in your Z buffer at the same time, and if you're reasonably front to back sorted it will STILL be significantly faster on all hardware. That's even before you start using a complex vertex shader where Z-passes get a lot slower and/or more complicated.
Except any directional lighting will be done after shadow calculations. Why not bake ambient lighting into one of these passes?
 
Most of the brains of 3dfx had left for Nvidia and ATI way before the V3 came out.

Actually many left just after the V1 was released.

They seen how horrible the management was there.
 
Chalnoth said:
Except any directional lighting will be done after shadow calculations. Why not bake ambient lighting into one of these passes?
sigh... if you've got a lighting algorithm that absolutely requires full Z information and you can't do anything else beforehand, then yes, a Z-only pass might make sense, in that you have to form the entire Z buffer before you can get to the screen buffer anyway.

In preformed Z cases, then on R300 it might or might not be faster to make use of the colour channel during the first pass. It depends on several complex factors (there are lots of interaction, e.g. might have to have more shaders with and without the ambient calculation).

I did acknowledge this in my original post, although I should have been a little clearer that it applies to anything that needs a complete Z-buffer before it can do lighting rather than just the Doom3 algorithm.
 
Back
Top