Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

krychek said:
There has been some work already on having a gpgpu friendly libs built on top of the graphics APIs. (Brook, Sh ) But this only makes the gpgpu programmer's job easy - they don't expose any special features of a particular chip (as you mentioned). Once the functionalities of a GPU become almost fixed then we could definitely do with a high level non-graphics API. But until then, I don't see how a common API can be designed that can offer all the capabilities of NV and ATI. Now ATI's gpu has support for scatter for this generation but NV's does not, how will this be handled? An extension? This will just slow down the API/ dumb down the API. OGL doesn't change fast enough because it has to be backward compatible and hence only really consistent extensions make it into the core.

The reason for the low level APIs is to immediately provide the low level access for an architecture and not care about backward compatibility. This takes the burden off the IHVs too and its upto the community/academia to come up with any API on top of this that is hardware architecture independent. In this scenario, there is no room for IHVs to disagree with each other or worry about any backward compatibility (other than the graphics APIs) and they can just focus on the hardware.

If it happens that all IHVs agree on the architecture then obviously we would ask for a high level vendor agnostic API :D.
I agree with you, there are many things still to be done. I was discussing with Sireric this very subject in Ibiza, and he said that there were many things to take into consideration for these sort of applications; for instance, one parametre to consider is the fact that, currently, the graphics driver works in a certain way because it is designed with the purpose of game-playing. So certain optimizations, although very valid for games, tend to give totally different results when used by a gpgpu-type application - and if that's not enough, think about different driver revisions which introduce new optimizations, or alter existing ones :D

The reassuring part is that ATI seems very commited to this whole thing, and I believe that, with academia actively working on this, we will have some very pleasant surprises within 2006 :)
 
Kombatant said:
The reassuring part is that ATI seems very commited to this whole thing, and I believe that, with academia actively working on this, we will have some very pleasant surprises within 2006 :)

Who in academia is working on this? Any links?

Nite_Hawk
 
Nite_Hawk said:
Who in academia is working on this? Any links?

Nite_Hawk
If you notice the GPGPU people in the right column of the site, you will see that some of them are students doing their PhD thesis :)
 
RoOoBo said:
(I wonder if it would be better named as tile generator, what do they send to HZ test? fragments or a single tile that later may be further divided into quads?)
Each quad actually contains the HiZ (which is why its effectiveness at higher res actually scales with the number of quads), so each of the HiZ's will only see the tiles that quad is operating on, AFAIK.
 
RoOoBo said:
On another matter, Jawed seems to be trying to explain that if you assign work on a framebuffer tile basis to completely separated quad pixel pipelines you have some interesting benefits. The first one being no need for cache coherence between each of the pipelines as each pipeline never ever touches a single bit from a framebuffer region from another pipeline. Then you could assign the ROP in a pipeline to a MC, if you had enough of them, providing something like semi dedicated bw per pipeline, which may be or not a good idea. Each pipeline ROP would be accessing their own separated portion of memory, and mostly (at least related only to the ROPs) having their own separated share of bw, reducing conflicts and similar stuff. One possible problem would be work unbalancing if the queues around the pipeline aren't large enough or if someone with a very bad bad intention only renders to the regions assigned to a single pipeline ;) (but what is the point of chess-like rendering?).
This isn't a problem specific to having ROPs assigned to MCs. Even if the ROPs have free access to any MC, that doesn't help if all tiles that need to be written belong to one MC.
However, having no fixed ROP-MC relation potentially gives you more freedom in optimizing the framebuffer layout/distribution across multiple channels for a specific bandwidth requirement mix.
 
Kombatant said:
If you notice the GPGPU people in the right column of the site, you will see that some of them are students doing their PhD thesis :)

Ack, I hadn't had time yet to visit the site! :oops:

Some of the research papers on that site look positively delicious! :devilish: The real time global illumination work looks especially interesting. I need to get my home computer setup again so I can start working on building a raytracer again. I was going to try to implement KD-Trees, but it sounds like atleast so far as GPUs are concerned the bounding volume hierarchy traversal technique is better...

So much to read, so little time!!! UGGH!!!! I need those 256 hour days Rys was talking about...

Nite_Hawk
 
RoOoBo said:
Then you could assign the ROP in a pipeline to a MC, if you had enough of them, providing something like semi dedicated bw per pipeline, which may be or not a good idea. Each pipeline ROP would be accessing their own separated portion of memory, and mostly (at least related only to the ROPs) having their own separated share of bw, reducing conflicts and similar stuff.
The primary reason why I suspect this kind of organisation is not implemented in R520 is textures. Since MC channels enforce a tiling scheme in memory, and since textures will be spread randomly across tiles, a large portion of all memory accesses will not be channel-coherent.

Indeed, memory accesses are best spread across as many channels as possible for texture reads, as far as I can tell.

Additionally, R520's Ring Bus is designed explicitly to support a many-to-many relationship between memory clients and memory tiles.

R520's 32-bit channels makes the minimum memory access half the size of previous GPUs. But so far we have no clear description of the specific benefits that affords. Particularly as there is a general push (seemingly) within R520 to "bulk-up" memory accesses to make the best use of the latency-tolerant clients (texture pipes, RBEs, Hierarchical-Z ...). There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they :?:

At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed
 
RoOoBo said:
In my opinion (and I don't really know much about rasterization and I had more than enough pain implementing it on the simulator) only Triangle Setup, if anything, may be shared for all the quad pipes in ATI designs. As they have explained a number of times they bin triangles to each pipe based on their tile distribution algorithm and then they seem to implement a fragment generator (I wonder if it would be better named as tile generator, what do they send to HZ test? fragments or a single tile that later may be further divided into quads?) per pipeline. So they have N rasterizers 'walking' each their own assigned triangles.
I agree with all that, rasterisation/interpolation can be multi-threaded, following directly on from set-up.

The only demerit I can think of is that texture data needs to be duplicated in texture caches multiple times when the same textures are being used across multiple screen-tiles, simultaneously. I expect that happens quite a lot.

I expect there's a protocol in the MC to handle that case, so that texture reads can be aggregated if multiple pixel units (quads) request the same textures (presumably within a short timespan).

Jawed
 
Jawed said:
The only demerit I can think of is that texture data needs to be duplicated in texture caches multiple times when the same textures are being used across multiple screen-tiles, simultaneously. I expect that happens quite a lot.

I expect there's a protocol in the MC to handle that case, so that texture reads can be aggregated if multiple pixel units (quads) request the same textures (presumably within a short timespan).
That's what the L2 cache does, on NV4x.
 
Bob said:
That's what the L2 cache does, on NV4x.
And what's interesting is that ATI haven't bothered with an L1/L2 design since R300, so there must be some secret sauce we're missing.

Jawed
 
Jawed said:
The primary reason why I suspect this kind of organisation is not implemented in R520 is textures. Since MC channels enforce a tiling scheme in memory, and since textures will be spread randomly across tiles, a large portion of all memory accesses will not be channel-coherent.
ROPs being tied to MCs doesn't mean TMUs are. I'm not sure which disadvantage you see here.

There must be certain kinds of tasks that thrive on these very small memory accesses, but what are they :?:
Compressed texture tiles (64 bits per 4x4 tile for DXT1 and 3Dc).

At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed
Then why multiple channels at all? I think this distribution would be a very bad idea, it increases granularity and every framebuffer access means switching pages on every memory channel.
 
Jawed said:
At the heart of that question is how memory-tiling is utilised. If you take the 8 bytes of a typical pixel (4 bytes colour, 3 bytes Z, 1 byte stencil) it looks like it's possible to tile pixels "one byte" at a time - so a single quad of pixels is spread across all 8 memory channels, 4 bytes at a time per memory tile. 8 separate tiles all together.
  • Tile 1 - one byte of red for four pixels
  • Tile 2 - one byte of green for four pixels
  • ...
  • Tile 5 - byte 1 (out of 3) of Z for four pixels
  • ...
  • Tile 8 - stencil byte for four pixels
But that's just a guess (though a recent ATI patent suggests something pretty similar as far as I can tell)... (How does memory tiling work with AA? And FP16 HDR?...)

Jawed

Jawed,

Forgive me as I'm not as well versed in this as you. Any mistakes I make are quite unintentional. ;)

Well, the way you demonstrate seems like it would be a very good arrangement for 8 byte pixels as everything lines up nicely. For something like FP16 pixels (same amount of z/stencil data?), it seems the downside would be that your tiles would no longer align, all of a sudden you'd have partially empty channels with z/stencil information because you are waiting on RGB data.

So perhaps this kind of arrangement doesn't make sense for FP16 pixels. Perhaps instead you'd want a packed format, where you are breaking up your pixels first into 4byte boundaries to send via seperate channels and then into 32byte boundaries. So 8 pixels in 3 transfers rather than 4 in one...

Nite_Hawk
 
Xmas said:
ROPs being tied to MCs doesn't mean TMUs are. I'm not sure which disadvantage you see here.
No, I was only saying that texture reads, being a large proportion of all memory accesses, wouldn't benefit from a design where all textures "fit" into a single tile. They can't, anyway, obviously.

The concensus, as far as I can tell, is that the best performance with large textures is for them to be spread across all memory tiles fairly evenly.

Compressed texture tiles (64 bits per 4x4 tile for DXT1 and 3Dc).
Sadly I've got no idea what proportion of a game's textures (or workload, if you prefer) would consist of such small textures :oops:

Then why multiple channels at all? I think this distribution would be a very bad idea, it increases granularity and every framebuffer access means switching pages on every memory channel.
That's what's puzzling me - it looks nice to start with, but is such fine granularity in the back-buffer ever desirable? I don't think so either... It contradicts my earlier points about the CBC being used to deal with large areas of pixels instead of the RBE working on quads.

What are the chances we'll find out from ATI how render targets are packed into memory tiles?...

Jawed
 
Nite_Hawk said:
Forgive me as I'm not as well versed in this as you. Any mistakes I make are quite unintentional. ;)

Well, the way you demonstrate seems like it would be a very good arrangement for 8 byte pixels as everything lines up nicely. For something like FP16 pixels (same amount of z/stencil data?), it seems the downside would be that your tiles would no longer align, all of a sudden you'd have partially empty channels with z/stencil information because you are waiting on RGB data.

So perhaps this kind of arrangement doesn't make sense for FP16 pixels. Perhaps instead you'd want a packed format, where you are breaking up your pixels first into 4byte boundaries to send via seperate channels and then into 32byte boundaries. So 8 pixels in 3 transfers rather than 4 in one...
It was a "loaded-guess" :devilish: designed to prompt a bit of discussion.

The problem we've got is we're way out in the dark. Back-buffer memory-tiling is highly relevant to ATI's OGL AA performance gains, but I think we're out of luck as far as definitive answers are concerned.

But yes, you're right, FP16 isn't a neat fit.

And AA looks awkward, too, since AA comes in 2xAA lumps (each AA sample is 8 bytes, so that's 16 bytes) so 2xAA and 6xAA get messy.

One solution to the mess, naturally, is an asymmetric packing - where 8 or 16 or more pixels in a block solve the "page" problem for non-RGBA8 back-buffers. In which case CBC comes into its own. Erm... :p

Jawed
 
Jawed said:
And AA looks awkward, too, since AA comes in 2xAA lumps (each AA sample is 8 bytes, so that's 16 bytes) so 2xAA and 6xAA get messy.

One solution to the mess, naturally, is an asymmetric packing - where 8 or 16 or more pixels in a block solve the "page" problem for non-RGBA8 back-buffers. In which case CBC comes into its own. Erm... :p

This may be related to why there is no difference at 2x and 6xAA in ATI's OGL-AA tweak:

http://www.guru3d.com/news.html#3182

:devilish:

Jawed
 
Jawed said:
No, I was only saying that texture reads, being a large proportion of all memory accesses, wouldn't benefit from a design where all textures "fit" into a single tile. They can't, anyway, obviously.

The concensus, as far as I can tell, is that the best performance with large textures is for them to be spread across all memory tiles fairly evenly.
I guess the optimal distribution of textures across channels also depends on how multiple textures of different size are used together in shaders.

But how is texturing related to having ROPs tied to MCs? And I'm not sure about what you consider a "memory tile".

Sadly I've got no idea what proportion of a game's textures (or workload, if you prefer) would consist of such small textures :oops:
Small textures? DXT1/3Dc compressed textures consist of 4x4 texel tiles that are encoded in a 64 bit block. That doesn't mean the textures are small.

That's what's puzzling me - it looks nice to start with, but is such fine granularity in the back-buffer ever desirable? I don't think so either... It contradicts my earlier points about the CBC being used to deal with large areas of pixels instead of the RBE working on quads.

What are the chances we'll find out from ATI how render targets are packed into memory tiles?...
ATI claims a best-case 24:1 compression ratio for Z data (with 6xAA enabled, so this means 8 bit per pixel). This compression is block- or tile-based. With tiles of 4x4 pixels, we get 128 bit per Z-tile, which is 32 bits times a burst length of 4.
 
Jawed said:
And what's interesting is that ATI haven't bothered with an L1/L2 design since R300, so there must be some secret sauce we're missing.
There's no point. With NV4x/G70's dispatch the quads are likely to be working very closely in region to one another, so an L1/L2 cache design works here - ATI's quads are working on completely different regions, making it much less likely that they'll need to share much texture data between one another.
 
A couple of months ago the distribution algorithm of quads to shader units implemented in the simulator was basically round robin (skipping fully occupied shader units) on a per quad basis. Not surprisingly, when I finally decided to check what was that doing, the bandwidth consumed for textures was N, being the number of texture units (or shader units as I was testing the usual 1:1 arrangement), times the texture data footprint. If you used that kind of random/round robin distribution (not on a single quad basis as it's also reducing a lot the hit rate) and there is a lot of texture data that is being accessed by many of the texture units a L2 cache makes a lot of sense. In the case, which I think may not be that frequent, that you can keep the 'current' texture working set (not for a whole frame for sure, but may be for similar batches that use the same texture data but that can't be stored in the small L1 texture caches) in the L2, this second level would also help to further reduce texture bandwidth. I would like to test the L2 arrangement but I don't have the time now. Related to ATI when I was testing how their texture caches worked (and they do some really funny things that I fully can't explain) in the R350 I discovered something like three bandwidth steps, first being at 8 KB (the texture cache size) and the next two (I would have to search that data as I don't remember at which sizes happened) were like a gradual reduction to the available memory bandwidth which made me wonder if they really implemented a L2 or not. Of course there was a fourth when you hit AGP bandwidth.

With 16x16 tiles texture units only share data at the borders at a much reduced rate. Now when I test 8x8 tiles as the distribution unit it reduces the excess texture bandwidth consumed.

However even if tiled based shader work distribution can help in accessing data (and removes the requirement for a crossbar to order back the quads to their propper ROPs or MCs) random/round robin distribution is good for better load balancing (the check board case for example :)) in the shader units. The tile algorithm has the danger, when queues aren't large enough, of unbalancing between the pipelines or a slow start for some of the pipelines if the first fragments/triangles miss them. What is better? Depends.
 
Xmas said:
I guess the optimal distribution of textures across channels also depends on how multiple textures of different size are used together in shaders.

I'm basing my ideas on this:

http://www.graphicshardware.org/presentations/bando-hexagonal-gh05.pdf

It refers to both texture storage and framebuffer organisation across memory channels.

But how is texturing related to having ROPs tied to MCs? And I'm not sure about what you consider a "memory tile".

I am suggesting that if texturing is "multi-tile" and a large consumer of bandwidth, optimisations for ROPs tied to MCs would prolly be sub-optimal for the same kinds of reasons that single-tiled texturing wouldn't make sense (though that's impossible, anyway).

A tile is a contiguous region of memory in one channel, corresponding to one or more units of burst. So a tile might equal the minimum burst-length in a channel, or a multiple of that.

I'm not sure. I don't know enough about this subject and what the typical constraints on memory access banking, paging, bursting etc. :cry:

Small textures? DXT1/3Dc compressed textures consist of 4x4 texel tiles that are encoded in a 64 bit block. That doesn't mean the textures are small.
:oops:

ATI claims a best-case 24:1 compression ratio for Z data (with 6xAA enabled, so this means 8 bit per pixel). This compression is block- or tile-based. With tiles of 4x4 pixels, we get 128 bit per Z-tile, which is 32 bits times a burst length of 4.
Earlier I was forgetting that these memory devices have a burst length of 8, I think.

It would be so much easier if Eric would explain it all :!:

Jawed
 
Back
Top