New 3DLabs card anounced... (2)

pascal

Veteran
OK, I am in fact continuing this thread
http://64.246.22.60/~admin61/forum/viewtopic.php?t=862&start=0
The problem is that the forum database is not updating when we have a new post on this thread. Quoting myself:

Few things:
- The 256bits bus will help with high sustained fillrate with next-generation games like Doom3. IIRC GF3 was only capable to do 30 fps (no aniso, compressed textures), my guess it means only sustained 600MTexels. This new 3DLabs chip has the potential to be three times as fast.
- This new chip dont need a 4 way crossbar because it use a tile (8x8) increasing the 2D spatial locality hit rate. There is no need of 64bits access to the framebuffer. With clever design of the caches a 2 way could be enough.
- It maybe used as a advanced image processor (it has an digital input).

It is like life, everyday something new happens, and this is good news. This card has speciall potential because of the many professional applications it may have: CAD, Architecture, Medicine, etc...
Kudos to 3DLabs for their hard work.
 
I think a segmented 4-way memory bus still can provide better efficiency. Although pixels are ordered in 8x8 tiles, texture access can be chaotic in some situation, especially with anisotropic filtering enabled, or heavy dependent texture access.

Of course, I am still looking forward to this card. I waited for Permedia 3 for more than a year, and finally turned to a TNT2. I am very glad to see 3Dlabs come back :D
 
It was not always that 64bits access was used, IIRC even nvidia used a 256bits texture access (nAo info) with GF3. But the smaller the granularity the better the access.
 
How much performance increase do you think one would get from going with a 32bit granular interface with 8 crossbar memory controllers?
 
There are two things to notice about this issue: the footprint of a triangle (or part of a triangle) on the texture, and the way to store textures.

When anisotropic filtering is used, the shape of the footprint can be very strange. Since texture are normally swizzled, a near rectangle access gives best result. For bilinear/trilinear with mipmap, a texture cache can give very good efficiency even with a single memory bus. However, the same thing can not be said for high degree anisotropic filtering or dependent texture access.

I expect that a single 256 bits memory bus is very good at running at high resolution or multisampling FSAA. With good texture cache control, it can still be quite good at dependent texture access. However, I think a segmented bus can still be beneficial.

For some performance figures, I think a pure texture benchmark comparing between GF2 and GF3 may be able to show some hints. A texture benchmark on a high slope to mimic the effect of wall and ground textures, and with anisotropic filtering.
 
expect that a single 256 bits memory bus is very good at running at high resolution or multisampling FSAA. With good texture cache control, it can still be quite good at dependent texture access. However, I think a segmented bus can still be beneficial.

For framebuffer with 8x8 tiles the improvment probably is none.

So what do you mean is that smaller granularity is good for textures and I agree. The problem is how much? Any guess?
What about vertex?

How much performance increase do you think one would get from going with a 32bit granular interface with 8 crossbar memory controllers?
Probably without tiles the increase should be relativally high, my guess 10% or 20% compared to 2-way crossbar.
 
8x8 tiles can improve locality but I don't think it is that much. Note that GeForce already renders in 2x2 tiles. However, you still can observe some improvement from single 128 bits to 4x32 bits memory bus. A larger tile can also result in larger waste for smaller triangles.

Vertex data are generally streamed and don't need fine granularity. For an example, GeForces like 32 bytes vertex, followed with 24 bytes and 64 bytes. Therefore, fine granularity is not too useful for vertex data. On the other hand, vertex data are not the most bandwidth demanding data.
 
8x8 tiles can improve locality but I don't think it is that much. Note that GeForce already renders in 2x2 tiles. However, you still can observe some improvement from single 128 bits to 4x32 bits memory bus. A larger tile can also result in larger waste for smaller triangles.
Do you mean 2x2 framebuffer tile?

If I understand correctly 3DLabs use a 8x8 framebuffer tile (I suppose it is an ordered square tile). You will only render triangles that are inside this tile, then the framebuffer trafic will probably not improve with smaller framebuffer granularity. Just change the entire tile (write back) when you finish it. But I agree that texture traffic could be improved with smaller texture granularity. See this link: http://www.anandtech.com/video/showdoc.html?i=1614&p=6

Maybe I did not understand it correctlly.
 
I believe that kind of tiling is done by any modern 3D-chip, and most not so modern too. There was actually some talk about a nvidia patent regarding that here some time ago, and I believe that the tile size mentioned was 8x8. It's just something that they don't make any fuzz about. But I don't think any (including P10) must read/write the whole tile. That would be too much of a waste. So even if P10 work in tiles, it should be able to access fractions of it efficiently.
 
Every single pixel will need to be written at least once for every frame.

But I agree that read and write all tile in single pass will keep the rendering pipeline empty for some time. Lets say it read a row each time (256bits) then it will need up to 8 reads or 8 writes.

Question: are we talking about the same tiling technique? Or are you talking about a cache block with tile format? Cache blocks will be read and write from and to the main memory many times during a single frame, real tiles will write only once to main memory.

If you are talking about a cache block with tile format then you are right.
 
Well here's where I am confused. On the block diagram, right after the "visibility" stage, notice there is a "Store" stage. The only other "Store" stage is at the end of the pixel processor. What I want know... is this a store to main memory? If so - what is being stored?
Is it just visibility information for the tile (z-values, zmin, zmax)?
Or is something like information on which primitives touch the tile (allowing for partial or full deferred rendering)?

I know some sites have said it is not a deferred architecture, so what exactly do people think is going on in that stage?

Regards,
Serge
 
I assumed rendering by tiles in this case meant triangles are processed a tile at a time. Small triangles only require 1 tile, but large triangles are broken up into these 8x8 pixel tiles. Only visible pixels in these tiles would actually be changed. Unchanged pixels may or may not need to be written back to memory. By doing a tile at a time you can guarantee you'll have an efficient cache if it is organized the same way.

I got confused by the previous posts so I'm not sure if someone else described it this way.
 
I assume the same thing.
For every frame:
-First as triangles are sorted by tiles (tile processor)
-Then each tile is processed for oclusion culling and pixel rendering

I am confused too.
 
pascal said:
-First as triangles are sorted by tiles (tile processor)
-Then each tile is processed for oclusion culling and pixel rendering

I am confused too.
What you describe is something like a deferred renderer. It seems this is not the case.
I don't know what they are doing, probably they just subdivide a primitive and process it per tile on a small onchip memory that caches frame and zbuffer. In this way the hw may rasterize a group of primitives on chip and then write out the tile off chip when it have to flush the proper cache tile. Even if it could be self-defeating wasting memory bandwith reading and writing onchip cache tiles in situations where they are poorly used.
A primitive assembly engine that can analyze group of primitives and store useful info (frame/zbuffer coverage) about them would be a nice option in this case..imho

ciao,
Marco
 
pascal:
..., real tiles will write only once to main memory.
You're implying that they have an onchip memory that can hold all those "real" tiles. There's nothing that supports that idea.

The tiles I was talking about was simply to render a tile and when it's finnished store it (or the parts needed to store).

The best you could hope for is actually a tile cache. I don't know if cards today use a frame buffer tile cache, I remember some talk about ATI doing it long time ago (and if they did it then, they're likely still doing it). That way you could delay tile writes, and hope that some other nearby triangle will fill in the empty parts of the tile. With small triangles that's quite possible, since they often come from strips (good locality along strip). And to make the most out of the vertex cache, the strips are likely not to long but arranged next to each other. That gives good locality between strips too.
So with this trick, most tiles could be written completely without wasting bandwidth.

Another benefit from a frame buffer tile cache is that you could split the pixel program into pieces, and run each piece for all pixels in the cache. Kind of remaking the pixel program to an internal multi-pass algorithm. That way you could reduce the hit from page breake thrashing when multitexturing, since each internal pass could be made with just one or two textures.
This might however not work so well for P10 if the pixel processors can do things like dynamic branches and loops. Since different pixels might need access to different textures.

But there's nothing in the info I've seen that hints that P10 got a frame buffer tile cache. In fact there are hints that says the opposite. http://www.anandtech.com/video/show...ech.com/reviews/video/3dlabs/p10/pipeline.gif shows the tile processing as one step between T&L and visibility checks. If it were a cache, it should also be mentioned at the end of the pipe.


psurge:
Yep, the load=>visibility=>store pipe gives a strong hint that it's just stencil- and z-store.

3dcgi:
I agree that that's a likely way. I hope you didn't got confused by my post.
 
There were some new post wile I was writing the last one so here's some more:

pascal:
Now I see how you're thinking. It has been explicitly said that it's not a deferred renderer. Note the difference:
Code:
// Deferred like PoverVR
for( all tiles )
{
  for( all polys in tile )
  {
    for( all pixels in poly )
    {
      if( pixel visible )
      {
        store z 
        store poly reference
      }
    }
  }
  for( all pixels in poly )
  {
    calc and store pixel color
  }

  Store tile to main memory.
}

// IMR that render in tiles
for( all polys )
{
  for( all tiles poly hits )
  {
    for( all pixels in tile and poly )
    {
      if( pixel visible )
      {
        store z
        calc and store pixel color
      }
    }
  }
  Store changed pixels in tile to main memory.
}
I didn't say anything here about what happens if you're doing transparancies, or if you have a tile cache.

Note the big difference though
Deferred: for all tiles=>polys=>pixels
immediate: for all polys=>tiles=>pixels


nAo:
Agree
 
Thanks basic :)

Now I understand.

I was thinking in something like:
Code:
for( all tiles ) 
{ 
  for( all polys in tile ) 
  { 
    for( all pixels in poly ) 
    { 
      if( pixel visible ) 
      { 
        store z 
        calc and store pixel color 
      } 
    } 
  }
}
It use tiles but doesnt have deferred rendering. It could save bandwith but will use the fillrate.
 
pascal,

Such system requires storing all polygons in the scene and binning for tiles. Furthermore, it will have hard time handling frame buffer copies, just like normal tilers. And what it saves are all frame buffer bandwidth, nothing about texture bandwidth. Not very economical IMHO.
 
I agree PCChen.
Thats why I said that now I understand.

So the Tile 3DLabs is talking is just the way its renders the polygons and maximizes cache hits. Then I agree that smaller granularity crossbar will help.
 
Back
Top