If NV30 uses tile-based rendering, will Ati convert too?

JohnH,

I also see no reason why some form of HZ shouldn't be used by the binning process. Binning by itself is a rough form of rasterization with no overdraw reduction and storage for every tri (~fragment) touching a tile (~pixel). Perhaps this step could adopt some of the performance improvements of IMRs?

At the very simplest level, it could simply compute a min/max tile z (for primitives which completely cover a tile), and then reject anything with zmin greater than tile z max.

Perhaps the z min/max could be stored at some smaller subdivision than that of the tile... or, the tiling could be done in completely heirarchical fashion.

Serge
 
JohnH said:
On a side note, I'd hope that developers are starting to use indexed triangle lists and not strips these days as they can get you down to 0.5 vertices/triangle. And unlike strips the mesh can be reordered (useing something like as provided by D3DX) to take full advantage of vertex caching. This is going to be important if you're expecting to push these types of poly counts through any HW.

With cache limits generally on the order of 10 vertices, I'd really like to see an algorithm that could possibly produce 0.5 vertices/triangle for rendering.

Of course, I do know that a good model with have 0.5 vertices/triangle. I'm just not sure that with today's small caches you can get close to that many processed with smart ordering.
 
I haven't sat down and worked out the best you can get out of a small cache but its easy to get a better ratio than strips. I'm trying to remember what the difference between strips and optimised indexed tri lists was on a 9700, if I recall correctly the later was subdstantially faster.

Psurge, we've done some research on early Z check in the tiler HW I can't really say to much about it other than that the result for purely tile based min/max values weren't particularily good.

JohnH
 
JohnH, interesting... thanks for the info!

Anyway, I had another idea. Instead of storing verts and pointers to them - make the size of tile small, say 8x8 pixels. For each tri store 1 "vert" which contains params, necessary data for parameter interpolation in x and y, and a state id.

Then, in the tile bin, you store a fixed size triangle record containing a pointer to the parameters, a tile relative position (x,y), Z, stencil, Z slope, and a 64bit tile coverage mask. (Set bits correspond to pixels that are completely covered by the tri).

Processing is as follows:

say the geometry unit G outputs tris's in the described fashion.

Let G write it's output straight to main memory (or a tile cache).

- adding a tri to a bin doesn't immediately involve anything beyond a write to the appropriate bin and to the parameter buffer.

Alongside G, you have a unit which grabs tile bin data whenever
some spare memory bandwidth is available.

- concurrently to G, a bin at a time, perform triangle culling given triangle records for the bin (Z, Z slopes, and coverage masks).

To save some write bandwidth, you could accumulate some fixed number of triangle records instead of writing them directly to the bins. Then do occlusion culling on the accumulated records before writing them to the bins.

Basically it's an A-buffer approach at a granularity above that of a single pixel.
---------
Alternatively, you could store bin data normally, but for each bin, keep track of say the 4 biggest/nearest occluders. Cache these on chip for a large number of tiles, and cull each generated triangle against them before writing to main-memory.

Serge
P.S. out of curiosity - who do you work for?
 
JohnH said:
I haven't sat down and worked out the best you can get out of a small cache but its easy to get a better ratio than strips. I'm trying to remember what the difference between strips and optimised indexed tri lists was on a 9700, if I recall correctly the later was subdstantially faster.

Well, the nice thing about strips is that you can guarantee a minimum cache coherency. You could also optimize the strips using degenerate triangles and a strip that sort of goes back and forth across a surface. Have you looked at an implementation like this? (Of course, an optimal sitaution would be one in which you can batch multiple small triangle strips together).

Of course, the primary question is, why do this instead of triangle lists? The main reason that I can see is that it might be better given the smaller bus/CPU usage at render time.
 
Let's say the chip has a 10 vertex FIFO cache.
And let's assume we have a realy large regular mesh. The most basic one with six triangles meeting in each vertex, except on edges of the mesh.
Then it's possible to get <0.6 vertices/triangle.

The best limit I see is (N-1)/(2*N-4) for cache size N, and an "infinite" mesh of the type I described. With a favourable mesh you could actually get arbitrarily close to 0.5.

This is of course possible with both strips and triangle lists.

As far as I can see, the most efficient primitive would be a triangle list with a "primitive restart" index. Or even better, "primitive restart" indices that also say what kind of primitive that follows.
 
FWIW -- NV2x does triangle setup faster on strips than it does on tri lists. It also has special case logic to handle degenerate tris, assuming 2 degenerates to tie together strips it's actually faster to submit the degenerates than it is to restart another strip.
 
Guess you can make triangle strips perform as well as indexed tri list by inserting degenerates every time you want to turn a corner (I assume its 2 to maintain winding order), for a smallish cache thats probably an awful lot of corners, so you end up throwing away the advantage of fewer indices per tri. Basically indexed tri lists seem a whole lot simpler to use/ easier to get the best possible performance out of.

ERP, interresting, does it spot the degenerate triangles algorithmically or do you have to flag them in someway? Very surpised that NV2x wouldn't attempt to take advantage of its vertex cache with indexed tri lists, seems plain stupid to me.

PSurge, have a guess, there's not too many companies out there that are pro tiling...

John.
 
JohnH said:
PSurge, have a guess, there's not too many companies out there that are pro tiling...

Well, NVIDIA's GigaPixel Engineers, 3Dlabs, PowerVR and the likely pool of people who've floated in and out of those - who isn't pro-tiling these days! ;)

(Of course, I know who you work for! :)).
 
JohnH said:
ERP, interresting, does it spot the degenerate triangles algorithmically or do you have to flag them in someway?
Sidenote: you can <flag> those on the PS2 hw and gain something....unfurtunately PS2 GS still sucks ;)

ciao,
Marco
 
Write an external, full-size, z-buffer and frame buffer, clearing the scene buffer, and then writing a new set of tiles. This is apparently what the Kyro line does if this particular problem occurs.

AFAIK that's not what happens with Kyro when this problem occurs.

From what I heard it goes like this.

- Fill the bin.
- Render that part of the frame and write it to video ram.
- Clear the bin.
- Fill the bin with the remaining geometry from that frame.
- Render that part of the frame and write it to video ram.
- Combine the two peices in video ram.

Its called scene manager.

What you described can be done with Kyro, but its just an opinion in the Kyro settings (enable external Z and framebuffer) and is not done automatically AFAIK.
 
Teasy said:
Write an external, full-size, z-buffer and frame buffer, clearing the scene buffer, and then writing a new set of tiles. This is apparently what the Kyro line does if this particular problem occurs.

AFAIK that's not what happens with Kyro when this problem occurs.

From what I heard it goes like this.

- Fill the bin.
- Render that part of the frame and write it to video ram.
- Clear the bin.
- Fill the bin with the remaining geometry from that frame.
- Render that part of the frame and write it to video ram.
- Combine the two peices in video ram.

Its called scene manager.

What you described can be done with Kyro, but its just an opinion in the Kyro settings (enable external Z and framebuffer) and is not done automatically AFAIK.

It's impossible to combine two external framebuffers without also having an external z-buffer. And the Kyro couldn't possibly depth-sort the entire scene while working with only part of a bin without the geometry being sent twice. What you're describing will also have significant problems whenever any sort of blending takes place.
 
It's impossible to combine two external framebuffers without also having an external z-buffer. And the Kyro couldn't possibly depth-sort the entire scene while working with only part of a bin without the geometry being sent twice. What you're describing will also have significant problems whenever any sort of blending takes place.

I'll leave any discusion on what's possible and what isn't to someone at IMGTEC. But that is more or less how scene manager was described AFAIR and that is what Kyro is supposed to do when the bin overflows.
 
ERP, interresting, does it spot the degenerate triangles algorithmically or do you have to flag them in someway? Very surpised that NV2x wouldn't attempt to take advantage of its vertex cache with indexed tri lists, seems plain stupid to me.

It's automatic, I assume it looks for shared vertex positions.
It still takes advantage of the vertex cache with indexed tri lists, infact any indexed primitive.

Strips just have a higher potential upside if the tri is small and the shader relatively simple strips will be faster. For a long time we believed that strips would provide no net gain in speed since the extra indices are not significant bandwidth, and setup is rarely a bottleneck in anything but a benchmark, and infact on NV20 (which has somewhat different setup timing) our tri list code out performed out tristrip code in game by 5-6%.

NV2A and I assume NV25 basically invert that performance trend in game strips outperform tri lists.

There is another issue which has to do with CPU usage, if your submitting indices to the pushbuffer with the CPU (and you have no choice but to do this on a PC) then the 3x (more like 2.5x in reality ) will most likely cause you to become CPU bound when drawing high density meshes.
 
Degenerated triangles can be detected by just comparing vertex indices, with just a handful of gates.

But having optimizations for degenerate triangles, doesn't mean that they can't handle tri-lists efficiently.

It's up to the application writer to use it as well as possible.
A) Realy sloppy - random triangles (bad wrt caching and indices)
B) Some sense - long strips (use at least 2 out of 3 vertices from cache)
C) Getting good - cache optimized tri-list (optimal for cache)
D) Best - cache optimized tri-strip (likely as good as 'C' for cache, better for indices)

I agree that it's likely easier to write the program that generates 'C' than the program that generates 'D'. Nvidia have a program that stripifies meshes with their cache in mind (it at least tries to do 'D'). I don't know if they actually hit the optimum. But if the IHVs provide the software to optimize the mesh, it's probably decent. It should of course be possible to call the optimization at run/install-time to optimize for the card you have.

'E' would be if you had primitive restart indices that could say things like "start a left-winded fan", "start a right-winded fan", "start a tri-strip that begins as left-winded", ...
You could even have an index saying "start a tri-list" if there actually is a difficult part of the mesh that works best with those.

[Edit]
Dang, ERP beat me to it.

One sidenote:
What happens to cache efficiency if you apply tesselation on your optimized tri-list/tri-strip? Let's say you've optimized it to close to 0.5 vertices/tri, and then trueform it "one notch" giving you four times as many triangles/vertices. I asume that the order inside each truformed triangle is cache optimized, but it will probably break optimizations you've done. Giving you some 0.75-1.5 vertices/tri depending on how bad it breaks.
OK, maybe it might be possible to keep it closer to 0.75 vert/tri most of the time, so maybe it's not much to talk about.
 
Teasy said:
I'll leave any discusion on what's possible and what isn't to someone at IMGTEC. But that is more or less how scene manager was described AFAIR and that is what Kyro is supposed to do when the bin overflows.

And I guarantee you that this "scene manager" needs to have an external z-buffer to work. There's absolutely no way around it, not with the way the Kyro does things.

Basically, the only way that the Kyro might be able to get by without using the z-buffer during scene buffer overflow would be if the triangles were sorted before being sent to the video card. With the Kyro, all sorting/binning is done on-chip. Any future chip with hardware T&L would absolutely need to do all of the sorting/binning on-chip.
 
Basic said:
One sidenote:
What happens to cache efficiency if you apply tesselation on your optimized tri-list/tri-strip? Let's say you've optimized it to close to 0.5 vertices/tri, and then trueform it "one notch" giving you four times as many triangles/vertices. I asume that the order inside each truformed triangle is cache optimized, but it will probably break optimizations you've done. Giving you some 0.75-1.5 vertices/tri depending on how bad it breaks.
OK, maybe it might be possible to keep it closer to 0.75 vert/tri most of the time, so maybe it's not much to talk about.

That's one thing that I've been wondering, too. Of course, it should be possible to optimize rendering order within a tesellated triangle for tri strips, so that the hardware attempts to take a tri strip and turn it into another tri strip with many more triangles (Obviously, if the original strip is straight, the tesellated strip will sort of turn back on itself over and over).

I also wonder if there's been any research done in where to put the tesellation. It seems to me that it doesn't necessarily need to go before transformation/vertex programs.
 
I suspect the tessellation step present in current 3d pipelines was placed just before T&L because it, that way, doesn't need to modify the operation of any other unit in any major programmer-visible way. An example: To do tessellation of N-patches, you need the per-vertex normal vectors. Placing an N-patch tessellator after T&L would require the T&L to pass on the (transformed) normal vector to later stages in the pipeline along with the other data. Also, the projection matrix transform would probably need to be spliced out in a separate stage after tessellation, or else the normal vector won't make much sense. Finally, you would need to redo lighting for the polygons produced by the tessellation somehow.
 
arjan de lumens said:
I suspect the tessellation step present in current 3d pipelines was placed just before T&L because it, that way, doesn't need to modify the operation of any other unit in any major programmer-visible way. An example: To do tessellation of N-patches, you need the per-vertex normal vectors. Placing an N-patch tessellator after T&L would require the T&L to pass on the (transformed) normal vector to later stages in the pipeline along with the other data. Also, the projection matrix transform would probably need to be spliced out in a separate stage after tessellation, or else the normal vector won't make much sense. Finally, you would need to redo lighting for the polygons produced by the tessellation somehow.

Which is why it would make sense to place the tesellation between transform and lighting, for just basic T&L calcs. With a vertex program, it gets much more challenging. If we do see anything but pre-T&L tesellation, it will probably be as a set of new instructions for programmable hardware.

For example, the current two programs (fragment/vertex) could be split up into four:

1. Pre-tesellation vertex program
2. Tesellation program
3. Post-tesellation vertex program
4. Fragment program

Obviously every conceivable operation could possibly be done just by executing programs 2-4, but the pre-tesellation vertex program may help by doing some of the calculations on fewer vertices.
 
And I guarantee you that this "scene manager" needs to have an external z-buffer to work.

Yes there may need to be some sort of Z-buffer there (unless they do sort before sending to the chip with scene manager, which is perfectly possible) to connect the 2 peices. But that is still not what you described earlier.. unless I mis-understood what you meant by the comment I originally quoted?
 
Back
Top