If NV30 uses tile-based rendering, will Ati convert too?

mboeller · Oct 27, 2002

Chalnoth said:
And I guarantee you that this "scene manager" needs to have an external z-buffer to work. There's absolutely no way around it, not with the way the Kyro does things.

An external Z-buffer is no additional requirement for an TBR. The Kyro uses already an external Z-buffer for games like Mercedes Benz Truck Racing. Otherwise the shadows are not rendered correct.

KimB · Oct 27, 2002

Teasy said:
Yes there may need to be some sort of Z-buffer there (unless they do sort before sending to the chip with scene manager, which is perfectly possible) to connect the 2 peices. But that is still not what you described earlier.. unless I mis-understood what you meant by the comment I originally quoted?

Right. I'll see if I can re-explain my point of view on this issue:

Assumption:

The deferred rendering architecture in question does all sorting on the card. That is, on a software level, it acts exactly like an immediate-mode renderer. This is an absolute necessity for modern graphics cards, particularly with hardware T&L.

Given this, the hardware absolutely needs every single triangle in the frame to be passed to it before it begins rasterization (The performance impact of this can be reduced by using double buffering...which the Kyro apparently uses a form of, though the scene buffer memory is freed once a tile is rasterized, so it doesn't need all of the memory for two frames...). That is, if it wants to act entirely as a deferred renderer. Since the triangles can come in in any order, until the next triangle is sent to the graphics card, it will have no idea which tile it will go in, or what its depth values will be. In this way, if the scene buffer is overrun, and the hardware compensates by doing a second rendering pass, then there is absolutely no way around doing a full external z-buffer.

Another option for overunning the scene buffer would be to simply dynamically increase the size of the scene buffer. This will obviously cause a performance stall, probably even larger than what was described above.

In the end, I really feel that while deferred rendering is not an utterly useless concept, the hardware should not attempt to always cache the entire scene. In this way it will avoid massive frame drops in specific, even if rare, scenarios.

It also seems apparent that there really isn't a need for that much more fillrate. What we need are more complex shaders (more computational power...not necessarily more memory bandwidth) and more geometry.

MfA · Oct 27, 2002

It is only absolutely necessary for a modern graphics card of today to have a semi-complete list of post transform vertices if it uses tiling, a "post modern" graphics card with a corresponding API would not necessarily need one.

For tilers as they stand having a full external framebuffer is hardly an issue AFAICS ... it isnt memory which is expensive, it is bandwith. Needing both the full external framebuffer plus the extra storage for the scenebuffer (which you could easily pull accross AGP BTW) is not an issue, the storage poses almost no extra costs compared to the immediate mode card. As for bandwith, even in worst case it would only consume moderately more bandwith than immediate mode rendering (assuming unique texturing).

Accurate lighting will be able to eat any fillrate you throw at it for quite a while yet, you cant determine shadowing (in the general case) in a pixel shader.

KimB · Oct 28, 2002

MfA said:
It is only absolutely necessary for a modern graphics card of today to have a semi-complete list of post transform vertices if it uses tiling, a "post modern" graphics card with a corresponding API would not necessarily need one.

Why? That doesn't make any sense. As long as the transformation is done on the card, all of the vertices need to be transformed, sorted, and binned before rasterizing of the frame can begin, if you want to maintain deferred rendering. Otherwise there's nothing deferred about it.

For tilers as they stand having a full external framebuffer is hardly an issue AFAICS ... it isnt memory which is expensive, it is bandwith. Needing both the full external framebuffer plus the extra storage for the scenebuffer (which you could easily pull accross AGP BTW) is not an issue, the storage poses almost no extra costs compared to the immediate mode card. As for bandwith, even in worst case it would only consume moderately more bandwith than immediate mode rendering (assuming unique texturing).

Memory does start getting expensive once you add in FSAA modes and the like. It also stands to reason that eventually we'll be at the stage where micropolygons will be used (polys smaller than a pixel), where scene data storage will far outweigh framebuffer storage. It will be quite a long time before that happens, but moving to deferred rendering will only serve to delay it.

The primary point I'm trying to make is that the fillrate of today's high-end video cards, and therefore tomorrow's low-end, is very close to good enough. There's just no need to go for a deferred rendering approach. What we do need is for polycounts to increase significantly (Hopefully we'll have a primitive processor available soon...).

And as for other applications of improving fillrate, I was pretty certain that their memory bandwidth/computational power ratio would essentially always be equal to or less than that for "normal" rendering. For example, the shadows in DOOM3 look, from what I've read, to be more computational-intensive than memory bandwidth-intensive. The only exception would be floating-point buffers, but since those will only be used for advanced pixel shaders, I doubt they'll often need any more bandwidth than the additional computational power they will require.

OpenGL guy · Oct 28, 2002

Chalnoth said:
For example, the shadows in DOOM3 look, from what I've read, to be more computational-intensive than memory bandwidth-intensive.

The shadows in DOOM3 are computed via stencil ops on projected triangles. Is that what you consider "computational-intensive"?

KimB · Oct 28, 2002

I guess what I'm trying to say is that the primary benefit that I see from deferred rendering isn't actually total memory bandwidth used, but rather the ratio between memory bandwidth used and computational power.

I was under the impression that modern processors were going to get more and more bound by their computational power, as it applies to fillrate. Modern graphics architectures can take care of pretty much any amount of overdraw, if necessary (example: do initial z-pass across entire scene), so that leaves most of the benefit of a deferred renderer in not having to deal with an external z-buffer and always writing sequentially to memory.

KimB · Oct 28, 2002

OpenGL guy said:
The shadows in DOOM3 are computed via stencil ops on projected triangles. Is that what you consider "computational-intensive"?

No, not really. I was more alluding toward the fact that the primary benefit that the NV30 (and presumably the R300) will get in DOOM3 is the ability to do 2-sided stencil ops in a single pass, as opposed to the two passes needed for most of today's hardware. That is, it actually has to compute less.

As for what is more computational-intensive about DOOM3, the main reason is due to the initial z-pass, which should bring the fillrate efficiency of IMR's quite high. I don't really see the multiple passes in DOOM3 needing more memory bandwidth per pixel per pass than you see in most of today's games.

OpenGL guy · Oct 28, 2002

Chalnoth said:
OpenGL guy said:

The shadows in DOOM3 are computed via stencil ops on projected triangles. Is that what you consider "computational-intensive"?

Click to expand...

No, not really. I was more alluding toward the fact that the primary benefit that the NV30 (and presumably the R300) will get in DOOM3 is the ability to do 2-sided stencil ops in a single pass, as opposed to the two passes needed for most of today's hardware. That is, it actually has to compute less.

Compute less what? It's the same number of stencil operations either way, correct? Maybe you don't have to transform the triangles, but I don't think that's the limiting factor here.

As for what is more computational-intensive about DOOM3, the main reason is due to the initial z-pass, which should bring the fillrate efficiency of IMR's quite high.

No offense, but I can't make sense of this.

I don't really see the multiple passes in DOOM3 needing more memory bandwidth per pixel per pass than you see in most of today's games.

Have you actually computed the amount of bandwidth required for a typical pixel in DOOM3? How about for a typical pixel in Quake 3? How fast does Quake 3 run at 1024x768 on graphics card X? How fast do you think DOOM3 will run at the same settings on the same card? What's the bottleneck?

MfA · Oct 28, 2002

Chalnoth said:
MfA said:

It is only absolutely necessary for a modern graphics card of today to have a semi-complete list of post transform vertices if it uses tiling, a "post modern" graphics card with a corresponding API would not necessarily need one.

Click to expand...

Why? That doesn't make any sense. As long as the transformation is done on the card, all of the vertices need to be transformed, sorted, and binned before rasterizing of the frame can begin, if you want to maintain deferred rendering. Otherwise there's nothing deferred about it.

Well the term deferred rendering only makes sense in the context of a immediate mode (polygon pushing) rendering API ... with a scenegraph API the hardware could access and render the scenegraph directly in (approximate-) tile order, and there would indeed be little deferred about it (potentially apart from deferred shading). It would still be a tiler though.

Even in an immediate mode API you could allow the developer to associate bounding volumes with subsets of his rendering commands. I hope you can see how this could be usefull for a tiler ... hell it could be usefull for allowing the hardware to do a little more intelligent occlusion culling too without needing feedback to the 3D engine, especially with display lists (in the OpenGL sense, so that is a stored list of rendering commands and not just references to geometry).

For tilers as they stand having a full external framebuffer is hardly an issue AFAICS ... it isnt memory which is expensive, it is bandwith. Needing both the full external framebuffer plus the extra storage for the scenebuffer (which you could easily pull accross AGP BTW) is not an issue, the storage poses almost no extra costs compared to the immediate mode card. As for bandwith, even in worst case it would only consume moderately more bandwith than immediate mode rendering (assuming unique texturing).

Click to expand...

Memory does start getting expensive once you add in FSAA modes and the like.

Yes, the full external back buffer removes one advantage from the tiler. Merely pointing out that it does not actually add any real disadvantage at the moment (the storage and bandwith consumed by the scenebuffer are too small to be of much importance as I indicated earlier).

The primary point I'm trying to make is that the fillrate of today's high-end video cards, and therefore tomorrow's low-end, is very close to good enough. There's just no need to go for a deferred rendering approach. What we do need is for polycounts to increase significantly (Hopefully we'll have a primitive processor available soon...).

My point is that information to be used in longer and longer shaders will have to come from sampling the environment (in the form of shadow information, environment maps etc) there is only so much you can do in a shader without more inputs ... unless you start using procedural texturing, but as I said before ... I dont think that is terribly relevant.

And as for other applications of improving fillrate, I was pretty certain that their memory bandwidth/computational power ratio would essentially always be equal to or less than that for "normal" rendering. For example, the shadows in DOOM3 look, from what I've read, to be more computational-intensive than memory bandwidth-intensive. The only exception would be floating-point buffers, but since those will only be used for advanced pixel shaders, I doubt they'll often need any more bandwidth than the additional computational power they will require.

It will remain roughly the same as today, bandwith is a problem today.

Marco

KimB · Oct 28, 2002

MfA said:
Well the term deferred rendering only makes sense in the context of a immediate mode (polygon pushing) rendering API ... with a scenegraph API the hardware could access and render the scenegraph directly in (approximate-) tile order, and there would indeed be little deferred about it (potentially apart from deferred shading). It would still be a tiler though.

Right. The only thing is, no scene graphs are yet widely-available. It will be interesting to see if we ever begin to use them for realtime graphics. Yes, the use of scene graphs would have potentially huge benefits for a variety of reasons. Right now, I don't think it makes much sense to start talking about scene graphs. We don't even know if they'll ever come into widespread use.

Even in an immediate mode API you could allow the developer to associate bounding volumes with subsets of his rendering commands. I hope you can see how this could be usefull for a tiler ... hell it could be usefull for allowing the hardware to do a little more intelligent occlusion culling too without needing feedback to the 3D engine, especially with display lists (in the OpenGL sense, so that is a stored list of rendering commands and not just references to geometry).

I would think that this would only be particularly useful if the information was sent before the geometry.

Yes, the full external back buffer removes one advantage from the tiler. Merely pointing out that it does not actually add any real disadvantage at the moment (the storage and bandwith consumed by the scenebuffer are too small to be of much importance as I indicated earlier).

Too small to be of much importance yet. Average triangle sizes are currently too large to make it important.

My point is that information to be used in longer and longer shaders will have to come from sampling the environment (in the form of shadow information, environment maps etc) there is only so much you can do in a shader without more inputs ... unless you start using procedural texturing, but as I said before ... I dont think that is terribly relevant.

If you're talking about texture reads, then the tiler will need to do those just as much as an IMR. Additionally, there are many calculations that can be done without much additional input. A quick example would be a good fresnel reflection/refraction implementation. Doing the full fresnel calcs can be quite expensive.

It will remain roughly the same as today, bandwith is a problem today.

Marco

It's not that much of a problem, certainly not as much of a problem as it was back in the days of the GeForce2.

Take the GeForce4 Ti, for example. Given the very small performance hit from enabling 2x FSAA, these cards just aren't very memory bandwidth limited most of the time. You really need to go to 4x FSAA to turn memory bandwidth into a significant limitation. Future FSAA techniques will continue to improve this ratio (The Radeon 9700, for example, appears to be more efficient in FSAA memory bandwidth usage than the GF4...).

MfA · Oct 28, 2002

Scenegraphs are in widespread use, most non trivial 3D engines use them internally in one form or another (a rose by any other name ... ).

Geometry is not send very often as it is now, it will almost never be sent in the future. Geometry will be referenced, or created on the fly by the GPU. If you know beforehand in which tile the geometry is going to end up without creating and or transforming every vertex of it, from the bounding volume, then you can defer the creation and or transformation until you are rendering the tile (in which case just like an immediate mode renderer you can use the result and throw it away). So your scenebuffer just has to store the rendering commands and the geometry references, instead of the transformed vertices, which constitutes vastly less data.

If you want to have really complex functions with only a couple of inputs you are better off using a LUT based approach.

I was not talking about texture reads as such. Moreso that a lot of those textures, environment/shadow/fog/etc-maps, would have to be created on the fly (that also goes for the stencil buffer in case of stencil buffer shadows of course). Sucking up fillrate.

NVIDIA's FSAA is a prime example, the quality of the anti-aliasing is limited by bandwith ... the computational load of higher quality anti-aliasing than 2X isnt the problem, the memory bandwith is. Radeon9700 allows the quality to go a little higher, but doesnt manage to remove the probably mostly bandwith related hit for 4X, by trading extra computation (compression) for bandwith ... but that is a sacrifice in itself, takes transistors, just further going to show that there is never enough bandwith.

OpenGL guy · Oct 28, 2002

MfA said:
NVIDIA's FSAA is a prime example, the quality of the anti-aliasing is limited by bandwith ... the computational load of higher quality anti-aliasing than 2X isnt the problem, the memory bandwith is. Radeon9700 allows the quality to go a little higher, but doesnt manage to remove the probably mostly bandwith related hit for 4X, by trading extra computation (compression) for bandwith ... but that is a sacrifice in itself, takes transistors, just further going to show that there is never enough bandwith.

I don't understand your argument. Everything takes transitors. Things that make operations go faster, sometimes take more transitors: Does that make it a bad thing? Also, resources should be conserved, when possible. Bandwidth is a resource, so it's good to conserve it.

If you had an alternative form of AA that didn't use more bandwidth, chances are it would use more transitors.

Nagorak · Oct 28, 2002

One thing: the premise of this article is that ATi would be caught totally off guard by Nvidia's next chip (assuming it used TBR). Even with NDAs do you honestly believe that ATi and Nvidia don't have a pretty good idea what's going on with each other? I'm sure corporate espionage is pretty common...and while this may not go so far as actually having "spies" and "moles" or whatever, things leak out even with NDAs. A lot of these engineers apparently know each other and I'm sure everyone has a pretty good idea what everyone else is doing at all times.

MfA · Oct 28, 2002

It means it is sacrificing the potential to speed up rendering further, because it needs the bandwith more since otherwise it would simply not be able to store the results fast enough ... but if the bandwith was not a problem to begin with the actual rendering could be plain faster.

Teasy · Oct 28, 2002

The deferred rendering architecture in question does all sorting on the card. That is, on a software level, it acts exactly like an immediate-mode renderer. This is an absolute necessity for modern graphics cards, particularly with hardware T&L.

I'd just like to mention that the card in question (Kyro) doesn't have HW T&L.

Kyro does all sorting on the card yeah, but it doesn't have too AFAICS.

JohnH · Oct 28, 2002

You want to do the tiling/binning (not sorting, which implies something else) in HW as doing it in SW requires you to touch all your vertex data which is bad for the processors primary data cache, particularily on static geometry. This is less true if in the case of the HW not supporting HW vertex processing, but still makes sense in the cases in which titles are doing their own T&L.

In general IF you run out of parameter space you have to render to free memory, which means you need storage for a Z buffer in those cases. But just to re-iterate what I said some time ago (was it in this thread ?), you still get overdraw within each pass, this can still be a substantial saving, The tiler isn't really at a disadvantage in the overflow case as the memory associated with the PB is small in comparison to the back buffer and Z buffer itself, which the IMR also needs. And there's the BW advantages that I mentioned before...

John.

psurge · Oct 28, 2002

MfA, re: tiling scene-graph nodes...

Is the following essentially what you mean?

node:
bounding geometry, and pointers to other nodes
leaf:
bounding geometry, actual geometry, references to vertex/pixel/tesselation programs (perhaps just a unified rendering program), references to textures required, and a set of leaf parameters.

Would the GPU control the heirarchy? Would the GPU update bounding volumes, if positions or other leaf parameters change? Or are these things that the developer would have to handle each frame?

The main thing I don't understand with this approach: why not just tile leaf nodes and forget about the heirarchy? I'm not convinced that it buys you all that much. It's also something each developer is likely to have a slightly different approach to.

any reply much appreciated,
Serge

MfA · Oct 29, 2002

With hierarchy the tiling and occlusion culling are mostly output sensitive, without hierarchy the tiling and occlusion culling are mostly input sensitive. If you have say a million entities of which only around a thousand would be visible at any one time it would be better to be output sensitive.

You only need to update the parts of the scenegraph which change. With a scenegraph API the lower parts of the hierarchy would probably be under user control (say the hierarchy of the bounding volumes of a single player model) higher parts (space subdivision) would most likely be handled below the scenegraph API using only hints from the developer on how dynamic the entities are.

I think the concept of display lists (which have become unpopular of late unfortunately) and occlusion testing provides a way to put everything under control of the developer without a scenegraph API though.

What if under OpenGL you could just say "if Visible(boundingVolume) ..."? Together with display lists and an ability to declare additional state variables (for instance to pass time, for ongoing animations, without having to update each individual display lists which needs it) you could quite comfortably maintain your own hierarchical scenegraph through OpenGL AFAICS. Immediate mode 3D hardware would be able to do sensible occlusion culling with this, and as an added benefit tilers could parse the visibility checks to reorder the command stream for approximate tile order rendering. Personally I think this is a patently obvious and damned good idea ... but Im universally ignored every time I utter it

psurge · Oct 29, 2002

MfA said:
With hierarchy the tiling and occlusion culling are mostly output sensitive, without hierarchy the tiling and occlusion culling are mostly input sensitive. If you have say a million entities of which only around a thousand would be visible at any one time it would be better to be output sensitive.

erm, I'm not sure if I understood your post... This input and ouput sensitivity: do you mean sensitive to input/output order and/or size?

are you are saying: the GPU uses the heirarchy to get an object drawing order optimal for occlusion culling? i.e.:
render nodes/objects (or at least their z-values) which are close to the viewpoint first, thus updating an HZ buffer which is used to discard objects further along in the traversal of the heirarchy?

--

Note, I am not questioning whether or not scenegraphs are a good idea. A heirarchical bounding volume approach is a great way to do frustum culling, and to obtain a rough front to back ordering of entities.

What I am questioning is whether the GPU is the place to do the scenegraph traversal/management.

Now if I understand correctly, you are advocating doing it on the GPU (or with display lists) so that 1. developers can't screw it up, 2. visibility testing a node does not require synchronization of GPU and CPU activity?

KimB · Oct 29, 2002

MfA said:
Scenegraphs are in widespread use, most non trivial 3D engines use them internally in one form or another (a rose by any other name ... ).

Right. I was thinking more along the lines of an API-type scene graph. That is, one that would be used, in its entirety, across a variety of applications. I'm not sure this will ever happen, though it could.

Geometry is not send very often as it is now, it will almost never be sent in the future. Geometry will be referenced, or created on the fly by the GPU. If you know beforehand in which tile the geometry is going to end up without creating and or transforming every vertex of it, from the bounding volume, then you can defer the creation and or transformation until you are rendering the tile (in which case just like an immediate mode renderer you can use the result and throw it away). So your scenebuffer just has to store the rendering commands and the geometry references, instead of the transformed vertices, which constitutes vastly less data.

But to do this completely, you'd need to transform every vertex twice, except possibly in the case of higher-order surfaces (Where you'd just need to transform the patch/control points twice...not the tesellated surface). Geometry is not yet created on the fly, except in the case of subdivision surfaces, and is rarely duplicated (UT2k3 has the ability to reference the same geometry multiple times, but it turns out that it is slower to do it that way).

Again, you're talking about reducing memory bandwidth by using more transistors.

If NV30 uses tile-based rendering, will Ati convert too?

Similar threads