PDA

View Full Version : TBR bandwidth vs Immediate EZR


3dcgi
12-Mar-2002, 01:29
Does a tile based renderer like the Kyro reduce texture bandwidth more than an immediate mode renderer with an early z reject unit or does it just save z bandwidth?

I'm thinking that a TBR doesn't have any extra texture benefit in this case, but I haven't really thought about it too hard yet.

Dave
12-Mar-2002, 02:48
The only way that the two can be comparable is if the scene has a strict front-to-back rendering order, and even then a deferred architecture will have a slight benifit.

MfA
12-Mar-2002, 07:34
For framebuffer bandwith it will always win out, ignoring very small tiled textures ... though personally I would not like hardware for which performance would break down as soon you have lots of individual texturing, so you need the bandwith anyway.

Bandwith needed for geometry is shifting ... increasing against the tilers favour (since it can tripple it). With immediate mode API's tiling has its greatest advantage with relatively low poly counts and relatively simple pixel operation with multi-pass rendering, lots of back to front overdraw and with anti-aliasing. Or in other words their greatest advantage would have been in the past (they fucked up royal).

The time is running out for them, they have to hit it big soon and introduce some major API extensions to be able to even try to compete in the future. (For instance, if developers start heavily using NVIDIA's occlusion culling support where will that leave IMG? They would need to include in the API the ability to associate bounding volumes with geometry to be able to do the same thing as developers do in software with immediate mode rendering and feedback ... and when I say API I of course mean D3D, so they need leverage with m$, this would also nicely solve the geometry problem since it could just tile bounding volumes.)

Entropy
12-Mar-2002, 14:58
MfA wrote:
"Bandwith needed for geometry is shifting ... increasing against the tilers favour (since it can tripple it). With immediate mode API's tiling has its greatest advantage with relatively low poly counts and relatively simple pixel operation with multi-pass rendering, lots of back to front overdraw and with anti-aliasing. Or in other words their greatest advantage would have been in the past (they fucked up royal). "

As far as I can see your list of pros and cons seems OK.
But while poly counts most definitely is rising, I can also see multi-pass rendering increasing, overdraw increasing as complex outdoor areas and lots of mobile players/creatures/objects gain more widespread use, and anti-aliasing that gives good framerates is on most everybodys' wish list.

So by my limited understanding, some of the benefits of TBRs will continue to be valued and perhaps even increase a bit in importance.

Not that it matters one whit if noone brings compelling hardware to the market.

Entropy

elimc
13-Mar-2002, 01:37
So will TBR renderers actually slow down games in the future if they have high enough polygon counts?

3dcgi
13-Mar-2002, 03:23
The only way that the two can be comparable is if the scene has a strict front-to-back rendering order, and even then a deferred architecture will have a slight benifit.

I guess a flaw in my logic was I was automatically thinking of strick front-to-back ordering, but of course this isn't the case in the real world.

I've been thinking of a feature that defered architectures could implement, maybe you all have an idea if it would work or not or if it can already be done.

Currently defered renderers transform all polygons to screen space before rasterization. Could this data be read by the CPU so host effects could be performed with the results being written back to the graphics card? I'm sure developers could think of something to do with this kind of flexibility. Of course, an argument against this is to just make pixel shaders flexible enough to do everything someone might think of to do.

Nexus
13-Mar-2002, 04:09
So will TBR renderers actually slow down games in the future if they have high enough polygon counts?

No, because just like an IMR it can be built to satisfy future high poly needs. The two disadvantages a TBR has with high poly counts is storage space for the polys (several MB, not a problem with todays 64MB+ cards) and more work for the hidden surface removal unit (ISP). The latter does highly parallel work, so you can easily throw more transistor on it to give it more power to be able to scope with more polys.

You may find Kristof's PowerVR article interesting:
http://216.12.218.25/domain/www.beyond3d.com//articles/tilebasedrendering/index1.php

fremin
15-Mar-2002, 00:07
storage space for the polys (several MB, not a problem with todays 64MB+ cards)

This isn't a problem now, but It will probably become one in the future if PVR doesn't address the issue. Don't get it wrong, I don't think it will affect their currently available boards, or boards set for the immediate future for that matter, but the fact is that if they want to survive in this market they will eventually need to fix this problem since poly counts will increase far more than memory in the future (once we hit 128MB or 256MB i foresee us sticking with it for a while..or ditching it altogether for a different memory architecture in the future). I don't think this will be a problem for years, I just beleive that it will have to be addressed sooner or later (I heard 3dfx/gigapixel had a solution to this...anyone know any validity to this claim?)

elimc
15-Mar-2002, 02:22
What about when wee get to DX9. Won't it introduce more HOS? This would help reduce the polys by quite a bit wouldn't it?

mboeller
15-Mar-2002, 10:35
Based on the Article from Kristof the Scene-Buffer requirements seem really high; so I made an crude calculation :

20 Mio Vertex/sec @ 60fps => ~ 334000 vert's/frame
One vertex is around 64byte
One pointer is an 32bit number

=>

2 x 64byte x 334000 + (32/8)byte x 334000 = ~ 42 Mbyte !! :o

2 x 42Mbyte @ 60fps (read + write ) = ~ 5 GB/sec bandwidth !! :o


This numbers are rather high. Could this be correct??

Based on this numbers ( if correct!) I can understand that m$ uses an IMR in the XBox, cause the bandwidth demand for the scene-buffer alone would exceed the useful bandwidth of the XBox. And the scene-buffer would need most of the memory of the XBox, so only a small amount is left for the game and content.


Manfred

Dave Baumann
15-Mar-2002, 10:47
What about when wee get to DX9. Won't it introduce more HOS? This would help reduce the polys by quite a bit wouldn't it?

The problem is – what will they store in the bin? Will the bin store post tessellated polygon information, or the pre tessellated algorithm?

Storing the pre-tesselated algorithm data in the bin will reduce poly load for the bin but then you have to question of what to do when you actually start rendering the tiles. If the HOS data covers many tiles then you either have to tesslate all the HOS data for all the tiles it covers once you meet the first tile that that contains some HOS information; or you re-tessellate several times for each individual tile which could increase the poly load on the T&L.

Of course, this assume you have an architecture what can tessellate and transform HOS in hardware; KYRO, for instance, could only ever store the post tessellated information in the bin as the CPU has to tessellate and transform everything.

Kristof
15-Mar-2002, 11:36
This numbers are rather high. Could this be correct??


Your forgetting quite a few things like clipping, backface culling, actual storage technique/format, realistic throughput of todays and tomorrows TnL engines etc...

K-

mboeller
15-Mar-2002, 12:28
This numbers are rather high. Could this be correct??


Your forgetting quite a few things like clipping, backface culling, actual storage technique/format, realistic throughput of todays and tomorrows TnL engines etc...

K-

I thought so myself.

Can You give an better, more accurate example (based on the 20Mio Vert's/sec)? Or is this proprietary knowledge? An real example would help to see the possibilities and drawbacks of TBR's better. I think quite a lot of people will use simplified examples like mine above to calculate the storage and bandwidth demands of an TBR and come to the same wrong assumptions.


Manfred

MfA
15-Mar-2002, 16:06
For a new console it would be trivial to shift the burden from storage to computation ... just let the developer send stuff in tile order, no storage needed and hierarchical frustum culling aint that expensive.

Roger Kohli
15-Mar-2002, 19:48
Would that work if you wanted to use the GPU's T&L unit?

MfA
15-Mar-2002, 21:34
Yes.

Humus
15-Mar-2002, 21:45
Exactly what is "just let the developer send stuff in tile order" supposed to mean?

mboeller
15-Mar-2002, 22:12
For a new console it would be trivial to shift the burden from storage to computation ... just let the developer send stuff in tile order, no storage needed and hierarchical frustum culling aint that expensive.

OK; my example was not good, cause on an closed plattform you could optimise for the specific chip; but how can you work around this in the PC? Is it possible to do it in drivers (I suppose not)?.

Roger Kohli
15-Mar-2002, 23:26
Mfa:Yes.

How do you know which tile something will appear in before you have applied the transformations?

MfA
16-Mar-2002, 00:22
You dont, but the developer can make a conservative guess ... thats what frustum culling is about. A conservative guess is all you need to make it work.

To make it work well you need a good enough guess of course. I said nothing about overhead, I dont want to get into it really ... I just wanted to say that its easy to transform it into a (fairly tractable) computational problem instead of a storage one for a closed platform. I think computational cost will drop fast enough with increasing tile size to make it an attractive option, Im sure others will feel otherwise.

arjan de lumens
16-Mar-2002, 04:42
Having the developer pass polygon data in tile order doesn't really sound like a good idea. The problem is that the exact tile set covered by each polygon cannot be computed until after at least the transform part of T&L is performed. You can do a conservative approximation by, for each object in a scene, compute the tile set covered by its bounding box/sphere/whatever, and use the data to perform object-level binning, but the result will be that you pass every polygon in the object for every tile covered by the object's bounding box - which is incredibly wasteful in terms of memory bandwidth and T&L (as you keep reading in and T&L-ing the same polygons over and over again) once you get objects covering more than about 2 tiles; at that point, it would be cheaper wrt memory bandwidth usage to just do traditional post-T&L polygon binning.

Of course, you can split objects into smaller sub-objects to get around the efficiency problem, but then you need either really small sub-objects or really large tile buffers (so that the average sub-object dimensions are less than 1/2 of the average tile dimensions). Small sub-objects will cause large memory requirements, and large tile-buffers are expensive. Also, you still have to do a lot of software transforms to get the bounding boxes for all the sub-objects.

MfA
16-Mar-2002, 07:47
You would use hierarchies, and memory requirements would always be swamped by the lowest level of the hierarchy (the actual vertices) so I dont quite see how thats an issue.

And you dont have to transform stuff multiple times, you can always store what belongs to other tiles till you render them if it makes sense. Storage cost would be minimal compared to a full screen display list.

Dave B(TotalVR)
16-Mar-2002, 14:07
I suppose this would be a good point to turn people to my 'Tilers and High poly Counts' article at www.powervr.org.uk.

Well, I would but me bloody site has run out of bandwidth again:( Should be up soon.

Dave

arjan de lumens
17-Mar-2002, 02:43
Some comments to the article (which I hadn't read before):

12 bytes per vertex sounds awfully little when you have to keep gouraud colors and texture coordinates around on a per-vertex basis. I'd expect backface culling to cull a little more than 50% of all polygons and thus a little less than 50% of all vertices, so you could more reasonably assume 24 bytes per vertex (and half of them culled - this would probably apply equally to the 3dmark2001 test and 'gloom3'.) You don't seem to take into account that the buffers for vertex data have to be written to as well as just read from, which doubles the figure from 1.4 GBytes/sec to 2.8, which still does not take into account vertex pointers and vertices read more than once (which could double the number again). Also: 1600 * 1200pixels * 32 BYTES per pixel * 60 fps * 2 = 7 GBytes/sec matches the number you state for framebuffer traffic, but shouldn't that be 32 BITS rather than 32 BYTES ...? Same applies for the texture bandwidth number as well.

And to MfA:
Actually, when I think about it, doing binning of objects larger than single polygons may be a rather good idea - you could then defer most of T&L (all of it except T for bounding boxes for each 'object') until you actually start to render each tile, much as in IMRs. This would reduce memory usage and traffic substantially, as you would no longer need to buffer all T&Led vertices in off-chip memory all the time. Bin sizes would be much smaller too. With some caching of vertices (before or after T&L), you may even get near-IMR level geometry performance, even in the case of memory bandwidth being the main bottleneck. And it might not require developer support either - the driver could very well process vertex arrays into suitable 'object' hierarchies.

A variant of this method would be to defer tessellation of Higher-Order Surfaces until after binning, till just before rendering; such a scheme may be needed to keep TBRs from choking on the huge polygon counts that HOS tessellation tends to produce.

MfA
17-Mar-2002, 02:59
Without the developer being able to give bounding volume hints you arent going to be able to deduce where HOS's or vertex buffers will end up on the screen on the fly, analyzing the vertex shader to see what the hell it does is too much work :(

arjan de lumens
17-Mar-2002, 03:59
Analyzing a vertex shader program to determine whether a vertex is passed through only a simple transform for xyz data is doable, but a bit difficult. If the vertex shader pogram does anything more fancy (e.g. matrix palette skinning) then computing a bounding box witout transforming every vertex is not really doable without said hinting anymore (it's still possible to defer the parts of the program that do not affect the vertex xyz coordinates)

mboeller
17-Mar-2002, 11:40
IMHO;

the best would be to have an transparent form of deferred Polygon-T&L engine built in hardware (transparent for all DX and OpenGL apps ).

With such an engine the deferred portion would go one step up from the pixels to the polygons and so maybe this could solve the problems deferred renderers will face in the future
Something like the KAGE-engine or the REVI-engine from Fluidstudios. Fluidstudios even had advertised their engine as transparent and being able to be put into hardware. But the information about the REVI-engine is gone. :(


Manfred

MfA
17-Mar-2002, 12:00
Even just visible polygons present too big a load if you get a polygon per pixel.

Dave B(TotalVR)
17-Mar-2002, 12:28
"12 bytes per vertex sounds awfully little when you have to keep gouraud colors and texture coordinates around on a per-vertex basis."

Its what I calculated from real world results.

"I'd expect backface culling to cull a little more than 50% of all polygons and thus a little less than 50% of all vertices, so you could more reasonably assume 24 bytes per vertex (and half of them culled - this would probably apply equally to the 3dmark2001 test and 'gloom3'.)"

Why? the 3Dmark 2001 test is a synthetic polygon throughput test, ok there will undoubtedly be backfacing polygons, but I cant see that being an enourmous percentage. I never did manage to find out a figure for that though.


"You don't seem to take into account that the buffers for vertex data have to be written to as well as just read from, which doubles the figure from 1.4 GBytes/sec to 2.8,"


IMR's have to do that write too dont they? because they buffer the vertex information when doing TnL I thought.

"which still does not take into account vertex pointers and vertices read more than once (which could double the number again)."

The 6MB buffer is actually two 3MB buffers, one for vertices and one for vertex pointers. Also, why would you need to read the same vertex twice?

"Also: 1600 * 1200pixels * 32 BYTES per pixel * 60 fps * 2 = 7 GBytes/sec matches the number you state for framebuffer traffic, but shouldn't that be 32 BITS rather than 32 BYTES ...? Same applies for the texture bandwidth number as well."

Im not entirely sure where you are referring to here (I just got up :evil: )

Dave

arjan de lumens
17-Mar-2002, 14:02
"12 bytes per vertex sounds awfully little when you have to keep gouraud colors and texture coordinates around on a per-vertex basis."

Its what I calculated from real world results.

Still sounds very little - you can't really stuff enough per-vertex data (x,y,w, 2 rgba colors, 1 or more sets of texture coordinates) into just 12 bytes. I'm still guessing that backface culling is taking place, affecting the numbers.


"I'd expect backface culling to cull a little more than 50% of all polygons and thus a little less than 50% of all vertices, so you could more reasonably assume 24 bytes per vertex (and half of them culled - this would probably apply equally to the 3dmark2001 test and 'gloom3'.)"

Why? the 3Dmark 2001 test is a synthetic polygon throughput test, ok there will undoubtedly be backfacing polygons, but I cant see that being an enourmous percentage. I never did manage to find out a figure for that though.


Took a look at the 3dmark2001 high-polygon test - it looks to me like a scene where it would be difficult for the application to avoid sending polygon data for surfaces that would be backface culled. So I will still estimate ~50% polygons backface culled. (And, the way I see it, the ideal synthetic test for T&L performance would be a test which resulted in 100% backface culling, giving the renderer zero work. Would be awfully vulnerable to driver-level cheats, though.)

"You don't seem to take into account that the buffers for vertex data have to be written to as well as just read from, which doubles the figure from 1.4 GBytes/sec to 2.8,"

IMR's have to do that write too dont they? because they buffer the vertex information when doing TnL I thought.


AFAIK, there is no reason why an IMR T&L unit would ever need to write transformed vertices to off-chip memory - instead, it just writes the vertices to a small on-chip buffer, where the renderer part can come and pick them up as needed. Whereas a regular tiler with T&L would need to write all transformed vertices to memory and then read them again later.

"which still does not take into account vertex pointers and vertices read more than once (which could double the number again)."

The 6MB buffer is actually two 3MB buffers, one for vertices and one for vertex pointers. Also, why would you need to read the same vertex twice?

You would need to read a vertex multiple times whenever a vertex is shared between multiple polygons (although this should be cached for triangle strips and fans) or the polygons that it belongs to are split across multiple tiles.


"Also: 1600 * 1200pixels * 32 BYTES per pixel * 60 fps * 2 = 7 GBytes/sec matches the number you state for framebuffer traffic, but shouldn't that be 32 BITS rather than 32 BYTES ...? Same applies for the texture bandwidth number as well."

Im not entirely sure where you are referring to here (I just got up)

On page 3 of your article you state "Now, at 60 FPS and in a resolution of 1600x1200x32 thats 7GB/s of bandwidth for the framebuffer + RAMDAC". I'm just trying to figure out where you could possibly get the 7 GB/s figure from.

Dave B(TotalVR)
18-Mar-2002, 14:06
Still sounds very little - you can't really stuff enough per-vertex data (x,y,w, 2 rgba colors, 1 or more sets of texture coordinates) into just 12 bytes. I'm still guessing that backface culling is taking place, affecting the numbers.

Well, if you think about it, it would be very easy to have simple compression methods for this data. for instance, you know what tile it is in and each tile is only 32x16 pixels in size. How accurate do you really need to be? I am quite sure the vertices and vertex pointers are stored in a unique fashion (probably a patented fashion)


Took a look at the 3dmark2001 high-polygon test - it looks to me like a scene where it would be difficult for the application to avoid sending polygon data for surfaces that would be backface culled. So I will still estimate ~50% polygons backface culled. (And, the way I see it, the ideal synthetic test for T&L performance would be a test which resulted in 100% backface culling, giving the renderer zero work. Would be awfully vulnerable to driver-level cheats, though.)

Well, that is a possibility.

AFAIK, there is no reason why an IMR T&L unit would ever need to write transformed vertices to off-chip memory - instead, it just writes the vertices to a small on-chip buffer, where the renderer part can come and pick them up as needed. Whereas a regular tiler with T&L would need to write all transformed vertices to memory and then read them again later.

If its writing to a small on chip buffer then its more a case of vertex caching then!;)

a) You would need to read a vertex multiple times whenever a vertex is shared between multiple polygons (although this should be cached for triangle strips and fans) b) or the polygons that it belongs to are split across multiple tiles.

a) you would never need to read multiple times for that reason as you could just load the entire contents of that tile's vertex information into cache anyway.

b) this is not very significant, especially with the sub-pixel polygons you get nowadays, this issue will only ever become less significant too because polygons only keep getting smaller. In older programs it was a problem, because it started increasing the poly load by like 50% which is probably a reason why Talisman was canned all those years ago.


On page 3 of your article you state "Now, at 60 FPS and in a resolution of 1600x1200x32 thats 7GB/s of bandwidth for the framebuffer + RAMDAC". I'm just trying to figure out where you could possibly get the 7 GB/s figure from.

I dunno, I'll look into it.

Dave

arjan de lumens
18-Mar-2002, 16:33
Quote:
Still sounds very little - you can't really stuff enough per-vertex data (x,y,w, 2 rgba colors, 1 or more sets of texture coordinates) into just 12 bytes. I'm still guessing that backface culling is taking place, affecting the numbers.

Well, if you think about it, it would be very easy to have simple compression methods for this data. for instance, you know what tile it is in and each tile is only 32x16 pixels in size. How accurate do you really need to be? I am quite sure the vertices and vertex pointers are stored in a unique fashion (probably a patented fashion)

Umm, the way I understood it, only the vertex pointers are stored on a per-tile basis - the vertices themselves are stored in another buffer with no regard for what tiles they might belong to. You could even have a polygon covering every tile with all its vertices offscreen and thus not belonging to any tile at all. So I don't see how the tile-based nature of the design can help vertex compression any.

OTOH, the 3dmark2001 high-polygon test looks like it uses untextured polygons only, so you will need only 1 rgba color and 0 sets of texture coordinates per vertex, making the 12 bytes per vertex figure sounding somewhat less unrealistic.


Quote:
AFAIK, there is no reason why an IMR T&L unit would ever need to write transformed vertices to off-chip memory - instead, it just writes the vertices to a small on-chip buffer, where the renderer part can come and pick them up as needed. Whereas a regular tiler with T&L would need to write all transformed vertices to memory and then read them again later.

If its writing to a small on chip buffer then its more a case of vertex caching then!

No. The T&L unit could well pass the vertices directly to the triangle setup unit, which would in turn (obviously) need to buffer 3 vertices in order to be able to do the triangle setup. The way I see it, a 'cache' is a buffer that holds local copies of pieces of off-chip memory - I don't see how the on-chip buffer qualifies as a cache, since there is never any need to make off-chip copies of the T&Led data data.


a) you would never need to read multiple times for that reason as you could just load the entire contents of that tile's vertex information into cache anyway.

Should be true >95% of the time, so it's not really very much of an issue (although caching 1000+ vertices does start to get expensive)


b) this is not very significant, especially with the sub-pixel polygons you get nowadays, this issue will only ever become less significant too because polygons only keep getting smaller. In older programs it was a problem, because it started increasing the poly load by like 50% which is probably a reason why Talisman was canned all those years ago.

OK, let's assume that polygons have an average width and height of 2 pixels - this would give an average polygon size of about 1 pixel (remember, they are triangles, so the area is much less than 2x2 full pixels). Then consider that the Kyro's tile height is 16 pixels. Now, if the average polygon is less than 2 pixels above the lower edge of the tile, it must be processed twice, as it covers 2 different tiles. Now, if all polygons have this average size, you get as a result that most vertices (all vertices, if we have a triangle mesh covering the entire tile) that lie less than 2 full pixels above or below the tile edge belong to at least 1 straddling polygon. 2 pixels above and 2 pixels below gives a span of 4 pixels height per tile where most vertices must be read twice - this would seem to affect nearly 25% of all vertices. This does not take into consideration the left/right tile edges either, so the figure would be closer to ~35% - for polygons barely larger than 1 pixel each.


Quote:
On page 3 of your article you state "Now, at 60 FPS and in a resolution of 1600x1200x32 thats 7GB/s of bandwidth for the framebuffer + RAMDAC". I'm just trying to figure out where you could possibly get the 7 GB/s figure from.

I dunno, I'll look into it.

OK. The result I get is 7 gigabits = ~900 megabytes per second...

Dave B(TotalVR)
19-Mar-2002, 18:22
Umm, the way I understood it, only the vertex pointers are stored on a per-tile basis - the vertices themselves are stored in another buffer with no regard for what tiles they might belong to. You could even have a polygon covering every tile with all its vertices offscreen and thus not belonging to any tile at all. So I don't see how the tile-based nature of the design can help vertex compression any.

Well, exactly how the vertex information is stored is a bit of a mystery. Try as I might I could not get that information from anywhere, only that there are two buffers for geometry information.

No. The T&L unit could well pass the vertices directly to the triangle setup unit, which would in turn (obviously) need to buffer 3 vertices in order to be able to do the triangle setup. The way I see it, a 'cache' is a buffer that holds local copies of pieces of off-chip memory - I don't see how the on-chip buffer qualifies as a cache, since there is never any need to make off-chip copies of the T&Led data data.

Well, the vertices will already be stored in local memory as a player model or whatever, you send them to be transformed so thats why it would be like a cache...

OK, let's assume that polygons have an average width and height of 2 pixels - this would give an average polygon size of about 1 pixel (remember, they are triangles, so the area is much less than 2x2 full pixels). Then consider that the Kyro's tile height is 16 pixels. Now, if the average polygon is less than 2 pixels above the lower edge of the tile, it must be processed twice, as it covers 2 different tiles. Now, if all polygons have this average size, you get as a result that most vertices (all vertices, if we have a triangle mesh covering the entire tile) that lie less than 2 full pixels above or below the tile edge belong to at least 1 straddling polygon. 2 pixels above and 2 pixels below gives a span of 4 pixels height per tile where most vertices must be read twice - this would seem to affect nearly 25% of all vertices. This does not take into consideration the left/right tile edges either, so the figure would be closer to ~35% - for polygons barely larger than 1 pixel each.

The way I see it if you have a polygon of the size u described then you have triangles that take up two pixels (2x2 divided by 2 is 2:). So for each vertex that lies within 2 pixels of a tile edge there is a chance that another vertex from that triangle is in another tile. 33% never are, because 33% of the time the vertex is the left or uppermost vertex of the triangle and hence the rest are in this tile. Another 33% almost certainly are (not definately) because it is the lowermost vertex of a triangle close to the top of the tile. The remaining 33% are somewhere in between, so this could be roughly guessed as being 50% of the vertices in the 2 pixel boundaries of the tile belong to a polygon which spans tiles.

This taken in mind, lets say there is 1 vertex per pixel. 512 vertices of which 88 ((32x4 + (16-4)x4)x50%=88) are read twice 88/512 is 17%.

Ok, some vertices in the corner may need to be read 3 or 4 times but that is going to be an infinitessimal percentage.

OK. The result I get is 7 gigabits = ~900 megabytes per second...

Yes, thats what I keep getting, strange as I double checked all my calcs, maybe I left the whole thing in bits....

Dave

arjan de lumens
19-Mar-2002, 21:39
Well, the vertices will already be stored in local memory as a player model or whatever, you send them to be transformed so thats why it would be like a cache...

Ummm ... that sounds more like using the onboard (offchip) memory as a cache for (untransformed) vertex arrays from AGP memory, which would be a rather different issue..? This would be the obvious way to avoid the AGP bandwidth bottleneck for a T&L unit regardless of whether it works for an IMR or a tiler.


The way I see it if you have a polygon of the size u described then you have triangles that take up two pixels (2x2 divided by 2 is 2)

It's not quite that simple. For a triangle with length and height 2 pixels, an area of 2 pixels is just the maximum possible area. Consider e.g. a triangle with its vertices at pixel XY coordinates (0,0), (2,2) and (0.99, 1.0). Do the math, and you will find that this triangle has a length of 2 pixels, a height of 2 pixels and a total area of 0.01 pixels. So obviously the average area of a bunch of 2x2 polygons would be less than 2 pixels - I'd estimate about 1 pixel for arbitrary 2x2 polygons.


So for each vertex that lies within 2 pixels of a tile edge there is a chance that another vertex from that triangle is in another tile. 33% never are, because 33% of the time the vertex is the left or uppermost vertex of the triangle and hence the rest are in this tile. Another 33% almost certainly are (not definately) because it is the lowermost vertex of a triangle close to the top of the tile. The remaining 33% are somewhere in between, so this could be roughly guessed as being 50% of the vertices in the 2 pixel boundaries of the tile belong to a polygon which spans tiles.

Umm, what about vertex sharing between polygons? A vertex that forms e.g. the topmost vertex of one triangle would frequently be the middle or bottom vertex of another triangle (or even multiple other triangles) - in this case, both triangles must be fully within the tile in order to keep us from having to read the vertex twice. This would in particular be true for triangle meshes. So for the mesh of 2x2 triangles covering entire tiles, I'd still say that you get that nearly 100% of all vertices less than 2 pixels from any tile edges will be read twice, which amounts to a little less than 35% of all vertices for the 32x16 tile size.