X1800/7800gt AA comparisons

Status
Not open for further replies.
Kombatant said:
If you notice the GPGPU people in the right column of the site, you will see that some of them are students doing their PhD thesis :)

Ack, I hadn't had time yet to visit the site! :oops:

Some of the research papers on that site look positively delicious! :devilish: The real time global illumination work looks especially interesting. I need to get my home computer setup again so I can start working on building a raytracer again. I was going to try to implement KD-Trees, but it sounds like atleast so far as GPUs are concerned the bounding volume hierarchy traversal technique is better...

So much to read, so little time!!! UGGH!!!! I need those 256 hour days Rys was talking about...

Nite_Hawk
 
RoOoBo said:
Then you could assign the ROP in a pipeline to a MC, if you had enough of them, providing something like semi dedicated bw per pipeline, which may be or not a good idea. Each pipeline ROP would be accessing their own separated portion of memory, and mostly (at least related only to the ROPs) having their own separated share of bw, reducing conflicts and similar stuff.
The primary reason why I suspect this kind of organisation is not implemented in R520 is textures. Since MC channels enforce a tiling scheme in memory, and since textures will be spread randomly across tiles, a large portion of all memory accesses will not be channel-coherent.

Indeed, memory accesses are best spread across as many channels as possible for texture reads, as far as I can tell.

Additionally, R520's Ring Bus is designed explicitly to support a many-to-many relationship between memory clients and memory tiles.

R520's 32-bit channels makes the minimum memory access half the size of previous GPUs. But so far we have no clear description of the specific benefits that affords. Particularly as there is a general push (seemingly) within R520 to "bulk-up" memory accesses to make the best use of the latency-tolerant clients (texture pipes, RBEs, Hierarchical-Z ...). There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they :?:

At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed
 
RoOoBo said:
In my opinion (and I don't really know much about rasterization and I had more than enough pain implementing it on the simulator) only Triangle Setup, if anything, may be shared for all the quad pipes in ATI designs. As they have explained a number of times they bin triangles to each pipe based on their tile distribution algorithm and then they seem to implement a fragment generator (I wonder if it would be better named as tile generator, what do they send to HZ test? fragments or a single tile that later may be further divided into quads?) per pipeline. So they have N rasterizers 'walking' each their own assigned triangles.
I agree with all that, rasterisation/interpolation can be multi-threaded, following directly on from set-up.

The only demerit I can think of is that texture data needs to be duplicated in texture caches multiple times when the same textures are being used across multiple screen-tiles, simultaneously. I expect that happens quite a lot.

I expect there's a protocol in the MC to handle that case, so that texture reads can be aggregated if multiple pixel units (quads) request the same textures (presumably within a short timespan).

Jawed
 
Jawed said:
The only demerit I can think of is that texture data needs to be duplicated in texture caches multiple times when the same textures are being used across multiple screen-tiles, simultaneously. I expect that happens quite a lot.

I expect there's a protocol in the MC to handle that case, so that texture reads can be aggregated if multiple pixel units (quads) request the same textures (presumably within a short timespan).
That's what the L2 cache does, on NV4x.
 
Bob said:
That's what the L2 cache does, on NV4x.
And what's interesting is that ATI haven't bothered with an L1/L2 design since R300, so there must be some secret sauce we're missing.

Jawed
 
Jawed said:
The primary reason why I suspect this kind of organisation is not implemented in R520 is textures. Since MC channels enforce a tiling scheme in memory, and since textures will be spread randomly across tiles, a large portion of all memory accesses will not be channel-coherent.
ROPs being tied to MCs doesn't mean TMUs are. I'm not sure which disadvantage you see here.

There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they :?:
Compressed texture tiles (64 bits per 4x4 tile for DXT1 and 3Dc).

At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed
Then why multiple channels at all? I think this distribution would be a very bad idea, it increases granularity and every framebuffer access means switching pages on every memory channel.
 
Jawed said:
At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed

Jawed,

Forgive me as I'm not as well versed in this as you. Any mistakes I make are quite unintentional. ;)

Well, the way you demonstrate seems like it would be a very good arrangement for 8 byte pixels as everything lines up nicely. For something like FP16 pixels (same amount of z/stencil data?), it seems the downside would be that your tiles would no longer align, all of a sudden you'd have partially empty channels with z/stencil information because you are waiting on RGB data.

So perhaps this kind of arrangement doesn't make sense for FP16 pixels. Perhaps instead you'd want a packed format, where you are breaking up your pixels first into 4byte boundaries to send via seperate channels and then into 32byte boundaries. So 8 pixels in 3 transfers rather than 4 in one...

Nite_Hawk
 
Xmas said:
ROPs being tied to MCs doesn't mean TMUs are. I'm not sure which disadvantage you see here.
No, I was only saying that texture reads, being a large proportion of all memory accesses, wouldn't benefit from a design where all textures "fit" into a single tile. They can't, anyway, obviously.

The concensus, as far as I can tell, is that the best performance with large textures is for them to be spread across all memory tiles fairly evenly.

Compressed texture tiles (64 bits per 4x4 tile for DXT1 and 3Dc).
Sadly I've got no idea what proportion of a game's textures (or workload, if you prefer) would consist of such small textures :oops:

Then why multiple channels at all? I think this distribution would be a very bad idea, it increases granularity and every framebuffer access means switching pages on every memory channel.
That's what's puzzling me - it looks nice to start with, but is such fine granularity in the back-buffer ever desirable? I don't think so either... It contradicts my earlier points about the CBC being used to deal with large areas of pixels instead of the RBE working on quads.

What are the chances we'll find out from ATI how render targets are packed into memory tiles?...

Jawed
 
Nite_Hawk said:
Forgive me as I'm not as well versed in this as you. Any mistakes I make are quite unintentional. ;)

Well, the way you demonstrate seems like it would be a very good arrangement for 8 byte pixels as everything lines up nicely. For something like FP16 pixels (same amount of z/stencil data?), it seems the downside would be that your tiles would no longer align, all of a sudden you'd have partially empty channels with z/stencil information because you are waiting on RGB data.

So perhaps this kind of arrangement doesn't make sense for FP16 pixels. Perhaps instead you'd want a packed format, where you are breaking up your pixels first into 4byte boundaries to send via seperate channels and then into 32byte boundaries. So 8 pixels in 3 transfers rather than 4 in one...
It was a "loaded-guess" :devilish: designed to prompt a bit of discussion.

The problem we've got is we're way out in the dark. Back-buffer memory-tiling is highly relevant to ATI's OGL AA performance gains, but I think we're out of luck as far as definitive answers are concerned.

But yes, you're right, FP16 isn't a neat fit.

And AA looks awkward, too, since AA comes in 2xAA lumps (each AA sample is 8 bytes, so that's 16 bytes) so 2xAA and 6xAA get messy.

One solution to the mess, naturally, is an asymmetric packing - where 8 or 16 or more pixels in a block solve the "page" problem for non-RGBA8 back-buffers. In which case CBC comes into its own. Erm... :p

Jawed
 
Jawed said:
And AA looks awkward, too, since AA comes in 2xAA lumps (each AA sample is 8 bytes, so that's 16 bytes) so 2xAA and 6xAA get messy.

One solution to the mess, naturally, is an asymmetric packing - where 8 or 16 or more pixels in a block solve the "page" problem for non-RGBA8 back-buffers. In which case CBC comes into its own. Erm... :p

This may be related to why there is no difference at 2x and 6xAA in ATI's OGL-AA tweak:

http://www.guru3d.com/news.html#3182

:devilish:

Jawed
 
Jawed said:
No, I was only saying that texture reads, being a large proportion of all memory accesses, wouldn't benefit from a design where all textures "fit" into a single tile. They can't, anyway, obviously.

The concensus, as far as I can tell, is that the best performance with large textures is for them to be spread across all memory tiles fairly evenly.
I guess the optimal distribution of textures across channels also depends on how multiple textures of different size are used together in shaders.

But how is texturing related to having ROPs tied to MCs? And I'm not sure about what you consider a "memory tile".

Sadly I've got no idea what proportion of a game's textures (or workload, if you prefer) would consist of such small textures :oops:
Small textures? DXT1/3Dc compressed textures consist of 4x4 texel tiles that are encoded in a 64 bit block. That doesn't mean the textures are small.

That's what's puzzling me - it looks nice to start with, but is such fine granularity in the back-buffer ever desirable? I don't think so either... It contradicts my earlier points about the CBC being used to deal with large areas of pixels instead of the RBE working on quads.

What are the chances we'll find out from ATI how render targets are packed into memory tiles?...
ATI claims a best-case 24:1 compression ratio for Z data (with 6xAA enabled, so this means 8 bit per pixel). This compression is block- or tile-based. With tiles of 4x4 pixels, we get 128 bit per Z-tile, which is 32 bits times a burst length of 4.
 
Jawed said:
And what's interesting is that ATI haven't bothered with an L1/L2 design since R300, so there must be some secret sauce we're missing.
There's no point. With NV4x/G70's dispatch the quads are likely to be working very closely in region to one another, so an L1/L2 cache design works here - ATI's quads are working on completely different regions, making it much less likely that they'll need to share much texture data between one another.
 
A couple of months ago the distribution algorithm of quads to shader units implemented in the simulator was basically round robin (skipping fully occupied shader units) on a per quad basis. Not surprisingly, when I finally decided to check what was that doing, the bandwidth consumed for textures was N, being the number of texture units (or shader units as I was testing the usual 1:1 arrangement), times the texture data footprint. If you used that kind of random/round robin distribution (not on a single quad basis as it's also reducing a lot the hit rate) and there is a lot of texture data that is being accessed by many of the texture units a L2 cache makes a lot of sense. In the case, which I think may not be that frequent, that you can keep the 'current' texture working set (not for a whole frame for sure, but may be for similar batches that use the same texture data but that can't be stored in the small L1 texture caches) in the L2, this second level would also help to further reduce texture bandwidth. I would like to test the L2 arrangement but I don't have the time now. Related to ATI when I was testing how their texture caches worked (and they do some really funny things that I fully can't explain) in the R350 I discovered something like three bandwidth steps, first being at 8 KB (the texture cache size) and the next two (I would have to search that data as I don't remember at which sizes happened) were like a gradual reduction to the available memory bandwidth which made me wonder if they really implemented a L2 or not. Of course there was a fourth when you hit AGP bandwidth.

With 16x16 tiles texture units only share data at the borders at a much reduced rate. Now when I test 8x8 tiles as the distribution unit it reduces the excess texture bandwidth consumed.

However even if tiled based shader work distribution can help in accessing data (and removes the requirement for a crossbar to order back the quads to their propper ROPs or MCs) random/round robin distribution is good for better load balancing (the check board case for example :)) in the shader units. The tile algorithm has the danger, when queues aren't large enough, of unbalancing between the pipelines or a slow start for some of the pipelines if the first fragments/triangles miss them. What is better? Depends.
 
Last edited by a moderator:
Xmas said:
I guess the optimal distribution of textures across channels also depends on how multiple textures of different size are used together in shaders.

I'm basing my ideas on this:

http://www.graphicshardware.org/presentations/bando-hexagonal-gh05.pdf

It refers to both texture storage and framebuffer organisation across memory channels.

But how is texturing related to having ROPs tied to MCs? And I'm not sure about what you consider a "memory tile".

I am suggesting that if texturing is "multi-tile" and a large consumer of bandwidth, optimisations for ROPs tied to MCs would prolly be sub-optimal for the same kinds of reasons that single-tiled texturing wouldn't make sense (though that's impossible, anyway).

A tile is a contiguous region of memory in one channel, corresponding to one or more units of burst. So a tile might equal the minimum burst-length in a channel, or a multiple of that.

I'm not sure. I don't know enough about this subject and what the typical constraints on memory access banking, paging, bursting etc. :cry:

Small textures? DXT1/3Dc compressed textures consist of 4x4 texel tiles that are encoded in a 64 bit block. That doesn't mean the textures are small.
:oops:

ATI claims a best-case 24:1 compression ratio for Z data (with 6xAA enabled, so this means 8 bit per pixel). This compression is block- or tile-based. With tiles of 4x4 pixels, we get 128 bit per Z-tile, which is 32 bits times a burst length of 4.
Earlier I was forgetting that these memory devices have a burst length of 8, I think.

It would be so much easier if Eric would explain it all :!:

Jawed
 
Dave Baumann said:
There's no point. With NV4x/G70's dispatch the quads are likely to be working very closely in region to one another, so an L1/L2 cache design works here - ATI's quads are working on completely different regions, making it much less likely that they'll need to share much texture data between one another.
What about ground textures (e.g. repeating cobbles) - won't there be multiple instances in the texture caches of ATI GPUs?

Jawed
 
Jawed said:
What about ground textures (e.g. repeating cobbles) - won't there be multiple instances in the texture caches of ATI GPUs?

Jawed

But no penalty for this for R5XX series, right? Don't know about previous gen.
 
RoOoBo said:
However even if tiled based shader work distribution can help in accessing data (and removes the requirement for a crossbar to order back the quads to their propper ROPs or MCs) random/round robin distribution is good for better load balancing (the check board case for example :)) in the shader units. The tile algorithm has the danger, when queues aren't large enough, of unbalancing between the pipelines or a slow start for some of the pipelines if the first fragments/triangles miss them. What is better? Depends.
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed
 
ERK said:
But no penalty for this for R5XX series, right? Don't know about previous gen.
I'm not sure how you conclude that, since R5xx has the same per-quad texture cache organisation as R3xx...R4xx.

The difference in R5xx is that the caches are larger (I think) and fully associative.

Jawed
 
Jawed said:
I think the other missing ingredient in this discussion is how the NVidia and ATI architectures construct batches.

If a triangle is too small to fill a batch (i.e. there are less quads in the triangle than the nominal batch size for the architecture), does the GPU fill the batch with more triangles (e.g. the succeeding triangles in a mesh)? Or is the empty space in the batch just entirely lost cycles?

Jawed

Seems like it would be incredibly wasteful to not fill the batch, but again, I suppose it depends on how hard it is to fill the batch with triangles from the succeeding mesh. It probably also depends on how much space is left. Do you worry about it if you can only cram one more triangle in?

Speaking of which, how big are the triangle batches? Do we know?

Nite_Hawk
 
Status
Not open for further replies.
Back
Top