If NV30 uses tile-based rendering, will Ati convert too?

GraphixViolence said:
ATI's documentation claimed their HyperZ III virtually eliminated overdraw using an early Z-test. Are there any significant benefits to tile-based architectures other than overdraw reduction? Because if not, this debate might be moot.
With early z you only guarantee that no pixel shading is done if the pixel being rendered is behind the current depth buffer. If a scene is drawn back to front there is still overdraw.
 
I'd just like to put my two cents in on the original topic of this thread.

Yes, if nVidia (or, to a lesser extent, ATI) puts out a product that uses a deferred rendering technique, ATI will need to follow to keep pace.

The primary reason for this that I see is that the opposition to a deferred rendering implementation would be dealing with large amounts of scene data. That is, large polycounts and complex shaders will take up lots of data, and therefore lots of memory bandwidth.

In other words, deferred rendering would probably be excellent for essentially every game in existence today. What I don't believe it would be good for is games heading into the next few years. This is contingent upon the idea that triangle processing rates will increase faster than pure fillrates going into the future (Which appears to be the case currently, and will become even more obvious as HOS come into use, whenever that may be).

In essence, then, this is what I see as the problem. If nVidia (or ATI) puts out a deferred renderer, it shouldn't be that challenging to make that deferred renderer perform much better than a comparable immediate-mode renderer for today's games. Then the other company, as well as any smaller companies who want to try to take on the Big Boys, will need to move to deferred rendering to stand a chance. I believe that this will, in turn, hold back the progress of improving polycounts and complex fragment shaders.

After having said all of this, I do feel that there is still some merit to the ideas of deferred rendering. Primarily, I feel that it would be beneficial for GPUs to not convert entirely to deferred rendering, but instead have a set limit for the size of the scene buffer that is not meant to capture all of scene data, not eliminating the need for an external z-buffer. This essentially eliminates the problem of possible scene buffer overflows with complex data for a full deferred renderer.
 
If bucket rendering was good enough for Toy Story it is good enough for us (if tilers become the norm as far as hardware is concerned random order polygon pushing, wether generated on the fly or not is besides the point, wont be the norm for very much longer on the software side).
 
arjan de lumens said:
Tiled framebuffers (where you split the frame into small tiles and then render a triangle to one tile before proceeding to the next tile, continuing with the same triangle until you have rendered all affected tiles, and only then switch to the next triangle) have been present since Voodoo1 - AFAIK, all voodoo, geforce and radeon series cards support tiled framebuffers. Which, IMO, has nothing to do with tile-based rendering.

Not tiled framebuffer, possibly textures. The VoodooII was most certainly a scanline renderer. In SLI mode the cards would render every other scanline.
If it's rendering to tiles, then the rendering process is tiled based. If it bins and sorts the data it's deferred renderer. Now "tile-based" has come to mean deferred renderer in many people's minds though.
 
MfA said:
If bucket rendering was good enough for Toy Story it is good enough for us (if tilers become the norm as far as hardware is concerned random order polygon pushing, wether generated on the fly or not is besides the point, wont be the norm for very much longer on the software side).

Realtime and offline rendering have different needs.

As an example, it was cheaper for the software rendering in Unreal to use a very robust visibility-detection algorithm rather than use a z-buffer. There's no reason to believe that things must be done in the same way in a future single-chip realtime 3D processor as they were done a few years ago in massive server farms.
 
It is just a tit for tat arguement :) I put my more serious reply between the braces. Your assertion that somehow HOSs or increased number of tris present a problem is based on assumptions which are more of a stretch than the ones needed for my simple "What is good for Toy Story is good for us" one IMO.

With a "large enough" tile-size (larger than a pixel, smaller than a frame) an approximate hierarchical tiling of a scene introduces no "significant" overhead (talking percentages here). After such a process you can render a tile with no significant storage overhead, because tiling would have been done above the triangle level, and consequently a bandwith advantage (assuming unique texturing). If tilers became the norm this would be as relevant for realtime as it is for offline rendering ... since hierarchical tiling would become the norm too, either performed by the developers themselves or by the drivers/hardware with information gained through a scenegraph API.
 
Well, I'm not entirely certain what you're talking about as you describe that.

What I was describing, very specifically, is an architecture that defers rendering until all depth data is readily available.

One thing that you must realize is that for offline rendering, the drawbacks of tile-based rendering are relatively insignificant. That is, the primary drawback, as I stated, stems from the memory requirements of the scene buffer. The problem arises for realtime rendering when the scenebuffer is overrun.

To illustrate this fact, I'll use what is essentially a worst-case scenario. Imagine a game that averages 4MB/frame scene buffer data, but spikes as high as 9MB per frame. The tiler in question uses an 8MB scene buffer (Note that these numbers are quite high compared to the Kyro and modern games, but still far below approaching the memory bandwidth/size requirements of modern z-buffers). For this little thought exercise, let's say that the gamer is cruising along at 1600x1200x32 with 6x FSAA.

Because of the efficiency of the tiler, there is only need for double buffering at 1600x1200x32 (no extra memory for FSAA needed). This means the video card only needs about 14.6MB of framebuffer data, plus the scene buffer at 8MB to store each frame. Obviously this will be better than an immediate-mode renderer.

Until that situation comes in where the scene data spikes to 9MB in a single frame. Once this situation hits, big problems occur. Basically, there are three ways to handle this within the deferred renderer:

1. Just don't worry about it and let graphical errors creep into the scene. Obviously a big no-no.

2. Dynamically allocate more memory for the scene buffer. Again, this can be very bad, and would likely cause a very noticeable stall in performance.

3. Write an external, full-size, z-buffer and frame buffer, clearing the scene buffer, and then writing a new set of tiles. This is apparently what the Kyro line does if this particular problem occurs. In the situation illustrated above, this will require about 58.6MB of framebuffer data, plus the 8MB for the scene buffer (6x for back buffer, 1x for z-buffer, 1x for front buffer...or even more if downsampling is not done on buffer swap)

So, in the end, what we have is a video card that, due to the worst-case scenario of its method of rendering, it will essentially always need to use more video memory for the frame buffer than an immediate mode renderer, and will take a very significant memory bandwidth hit from hitting that worst-case scenario. While it does remain true that with a scene buffer of this approximate size, the memory bandwidth required will still be less than that of the immediate-mode renderer, but that won't matter, as it's the change in required memory bandwidth that is of significance here.

This is also why I feel that a partial deferred renderer would be a good idea. Without attempting to buffer the entire scene, a partial deferred renderer wouldn't have the massive performance deltas from overunning a specific limit that is built into either the drivers or hardware.
 
I understand that there isn't any other viable paradigm present than the KYRO as a existing representative for Tilers, yet it still doesn't change the fact that it's a 1999 architecture/design.

I guess we'll get really wiser if and when PVR comes out finally with a potential next generation product.

In retrospect though I'm all for marrying both IMR's and TBR's advantages into one.
 
Humus said:
If it's rendering to tiles, then the rendering process is tiled based. If it bins and sorts the data it's deferred renderer. Now "tile-based" has come to mean deferred renderer in many people's minds though.

Dunno about that. In my case I just dont get the idea of tiled based rendering (sans deferred rendering!). As I wrote in my prior post (please revisit) I apparently makes tiled based rendering more complex that it is. I'm not trying to be ignorant ;) but trying to get my facts straight.
 
arjan de lumens said:
So, Humus, given that e.g. all the Radeons render triangles one tile at a time, does that make them tile-based renderers?

Yes imho. But then "tile-based" is quite a vague term and when people talk about it they usually really mean deferred renderer, which the Radeons certainly aren't.
 
Chalnoth, the memory footprint associated with the intermediate parameter buffer required by a TBR is tiny in comparison to the size of the other buffer's you describe. As such, in the situation you describe TBR memory usage is only marginally worse than an IMR, however the TBR's performance will still be superior to the IMR solution for a number of reasons... Back and Z buffer accesses are in tile bursts only, so cause considerably fewer page breaks than IMR access to the same. Large chunks of the scene are stilled renedered in deferred style so there is still a huge performance benifit gained from the relatively small parameter buffer. Read-modify-writes for translucency remain "BW Free" due to the absence of an external mem access (ok, a small reduction in benifit is seen due to small number of "passes" over scene) - consider the effect of this when you're rendering to new high precision FP surfaces (up to 128BPP). I could go on...

Anyway, the "scene storage problem" that's always raised is considerably less of an issue than you'd think, particularily if you actually want "real time" rendering, think i've mentioned why in other threads.

Later,
John.
 
LeStoffer said:
Dunno about that. In my case I just dont get the idea of tiled based rendering (sans deferred rendering!). As I wrote in my prior post (please revisit) I apparently makes tiled based rendering more complex that it is. I'm not trying to be ignorant ;) but trying to get my facts straight.

This is just a guess, but I'd say that using tiles with an IMR is slightly more efficient in terms of cache access. A triangle is more likely to approximate a square thanh a scanline, so the FB memory used is likely to have more contiguous chunks.

Do remember that there is quite a deal of complexity for a deferred renderer having an on chip tile buffer and bins for the geometry. an IMR tiler eliminates this, but also eliminates most of the advantages.
 
A few quick extractions of Stephen Morphet's patent, which I figure applies to what MfA meant (I'm not reposting the insanely long link to it):

In a preferred embodiment, this screen is divided up into a number of regions called macro-tiles, in which each of these consists of a rectangular region of the screen composed of a number of smaller tiles. Memory in the display list is then divided into blocks and these are listed in a free store list. Blocks from the free store are then allocated to the macro-tiles as required. The tiling operation stores polygon parameter data and object pointers for surfaces in each block associated with each macro-tile in which they are visible. When the memory for the parameters fills up, or reaches some predefined threshold, the system selects a macro-tile, performs a z/frame buffer load, and renders the contents of the macro-tile before saving it using a z/frame buffer store operation. Upon completion of such a render, the system frees any memory blocks associated with that macro-tile, thereby making them available for further parameter storage. The z/frame buffer load and store operations are restricted to the macro-tiles that are actually rendered rather than every tile on the screen as was previously the case. Tiling of either the remainder of the current frame or of the next frame to be displayed then continues in parallel with macro-tile renders and the allocation of blocks of memory from the same pool to further macro-tiles. It is not necessary to provide any double buffering of the parameter list and thus the memory requirements of the system are further reduced.

When tiling of a frame of image data is complete, the system can then begin to tile the next frame, even while the rendering of the previous frame is still in progress. The allocation blocks associated with macro-tiles from the new frame must be distinct from those associated with macro-tiles from the previous frame. This can be achieved by maintaining a second set of blocks independent of those used by the previous frame. This would be achieved by maintaining independently a second set of region headers. Allocation blocks are taken from the same free store for all frames, which means that it is not necessary to double the size of the display list. This gives a further saving in memory usage.

Of course there are explained possibilities of display list upper thresholds, pointer modifications, Z-compression yadda yadda I just can't quote it all.
 
Humus said:
If it bins and sorts the data it's deferred renderer. Now "tile-based" has come to mean deferred renderer in many people's minds though.
I would say a deferred renderer is one in which certain 'expensive' parts of the rendering process are delayed if there is a good chance that doing those calculations would be a complete waste of time. (i.e. doing texturing only after all possible hidden surface determination is performed).

You can still tile/bin objects and then render each of them in an immediate mode. This certainly simplifies the rendering process yet would still a reasonable amount of data transfer (although not nearly as much as a deferred system).
 
If you assume 1 million dual textured triangles with a normal in a 3d set of texture coordinates, with a vertex to triangle ratio of 1:1, how much
space is required to store the scene?

my estimate,
assuming 10 32bit floats per vertex, 3 32bit vertex pointers, and a 32bit state identifier per tri :

10*4*1MB = 40MB for vertices,

4*4*1MB = 16MB for tile bins (most conservative estimate)
you could store strips and get away with much closer to 1 index per strip, as well as 1 state id per strip, so best case
4*1MB = 4MB for tile bins.

you could cut it down further by making the size of the vertex pointer smaller, say 3 bytes instead of 4...

Anyway let's say roughly 50MB.

With 32bit Z+stencil and 32bit color that would be about the same size as
a framebuffer for 1280*1024 with 4x supersampling.

So - things don't seem all that bad for tilers even with 1million tris per frame (comparable to the number of visible pixels).
 
If you want to do "scene collection" paralelly with actual rendering of the tiles you might need to double buffer the scene data...
 
Hyp-X, no you don't, suggest you read the patent!

Psurge, not a bad finger in the air calc. Other things to allow for are, the vertex ptrs/indices can be very small as you only need to point at vertices in your immediate locality, compression, backface and small object culling (i.e. sub pixel, non pixel center tris can be culled as the rasterisation rules dictate that they should never be drawn).

On a side note, I'd hope that developers are starting to use indexed triangle lists and not strips these days as they can get you down to 0.5 vertices/triangle. And unlike strips the mesh can be reordered (useing something like as provided by D3DX) to take full advantage of vertex caching. This is going to be important if you're expecting to push these types of poly counts through any HW.

John.
 
Back
Top