Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Nite_Hawk · Oct 13, 2005

sireric said:
There's some slides given to the press that explain some of what we do. Our new MC has a view of all the requests for all the clients over time. The "longer" the time view, the greater the latency the clients see but the higher the BW is (due to more efficient requests re-ordering). The MC also looks at the DRAM activity and settings, and since it can "look" into the future for all clients, it can be told different algorithms and parameters to help it decide how to best make use of the available BW. As well, the MC gets direct feedback from all clients as to their "urgency" level (which refers to different things for different client, but, simplifying, tells the MC how well they are doing and how much they need their data back), and adjusts things dynamically (following programmed algorithms) to deal with this. Get feedback from the DRAM interface to see how well it's doing too.

We are able to download new parameters and new programs to tell the MC how to service the requests, which clients's urgency is more important, basically how to arbitrate between over 50 clients to the dram requests. The amount of programming available is very high, and it will take us some time to tune things. In fact, we can see that per application (or groups of applications), we might want different algorithms and parameters. We can change these all in the driver updates. The idea is that we, generally, want to maximize BW from the DRAM and maximize shader usage. If we find an app that does not do that, we can change things.

You can imagine that AA, for example, changes significantly the pattern of access and the type of requests that the different clients make (for example, Z requests jump up drastically, so do rops). We need to re-tune for different configs. In this case, the OGL was just not tuning AA performance well at all. We did a simple fix (it's just a registry change) to improve this significantly. In future drivers, we will do a much more proper job.

Sireric,

This is much more detailed information than I expected, thank you! Have you considered doing any data mining to match performance against access patterns? You probably could come up with some predictive models to do on-the-fly configuration based on whatever attributes you care about (say Z requests rops, etc), even for unknown situations. Sounds like it would be a very fun research project.

Nite_Hawk

Joe DeFuria · Oct 13, 2005

http://www.hexus.net/content/item.php?item=3668

Updated with Riddick scores and comparison to the GTX.

Adding in results for NVIDIA GeForce 7800 GTX running the 81.84 driver show that with the memory controller tweak an ATI product can overtake NVIDIA's flagship single-board SKU for the first time since Doom3's release.

Man...those nVidia OpenGL driver writers must suck. What have they been doing?

neliz · Oct 13, 2005

Although I normally refrain from using this word but, KUDOS to ATi, for the performance and their Open(GL)ness on this matter.

It's really appreciated and makes the X1K series a LOT more attractive all of a sudden.

And Amazing is the speed Rys processes everything, One refresh after another the data (and opinion) trickles in..

Moloch · Oct 13, 2005

I'd like to see Jen-Hsun Huang's face when he sees this.

sireric · Oct 13, 2005

Nite_Hawk said:
Sireric,

This is much more detailed information than I expected, thank you! Have you considered doing any data mining to match performance against access patterns? You probably could come up with some predictive models to do on-the-fly configuration based on whatever attributes you care about (say Z requests rops, etc), even for unknown situations. Sounds like it would be a very fun research project.

Nite_Hawk

There's a rather elaborate system in place, but the issue is that access patterns vary greatly even within one application -- One seen might be dominated by a shader, while another by a single texture and another by geometry (imagine spinning around in a room). You could optimize on a per scene basis, but that's more than we plan at this point (a lot of work). But we do plan on improving the "average" for each application. The basis for this, btw, is us measuring the internal performance of apps (in real time), and then adjusting things based on this. Multiple levels of feedback and thinking involved

John Reynolds · Oct 13, 2005

There's pork in the tree tops.

Geo · Oct 13, 2005

Rys updated (the link upstream) with Riddick numbers, some comparo to GTX and a hint on Serious Sam 2. Also

Just goes to show hardware is nothing without good software. Driver development and release schedules just got interesting again.

Wonder if they are wearing "the lemon face" over at NV today, or still feeling cocky about their 512mb part.

neliz · Oct 13, 2005

geo said:
Wonder if they are wearing "the lemon face" over at NV today, or still feeling cocky about their 512mb part.

teh N000s! another crisis meeting in Amsterdam? I should go there after my ISA2004 "class" tomorrow..

Geo · Oct 13, 2005

John Reynolds said:
There's pork in the tree tops.

This is for the omnivores who aren't only interested in the low-hanging fruit?

acrh2 · Oct 13, 2005

sireric said:
This change is for the X1K family. The X1Ks have a new programmable memory controller and gfx subsystem mapping. A simple set of new memory controller programs gave a huge boost to memory BW limited cases, such as AA (need to test AF). We measured 36% performance improvements on D3 @ 4xAA/high res. This has nothing to do with the rendering (which is identical to before). X800's also have partially programmable MC's, so we might be able to do better there too (basically, discovering such a large jump, we want to revisit our previous decisions).

But It's still not optimal. The work space we have to optimize memory settings and gfx mappings is immense. It will take us some time to really get the performance closer to maximum. But that's why we designed a new programmable MC. We are only at the beginning of the tuning for the X1K's.

As well, we are determined to focus a lot more energy into OGL tuning in the coming year; shame on us for not doing it earlier.

How does the change in MC programs affect scores in other games (d3d and ogl)?
The reason why I am asking this is because there seems to a large jump in frames @ 1600x1200 4xAA for doom3, but a smaller jump for Riddick (relatively speaking).
In other words, will the fix work only for Doom3 and Riddick or will it lower scores in other games?

Jawed · Oct 13, 2005

What's most interesting is that in CoR the X1800XT loses way more performance going from no-AA/no-AF to 4xAA/8xAF than the 7800GTX (at 1024 and 1280).

Since that is, specifically, the inverse of what we see in pretty much every other game, I guess that means there's an awful lot of optimisation to be done in CoR.

Jawed

Jawed · Oct 13, 2005

geo said:
This is for the omnivores who aren't only interested in the low-hanging fruit?

I think it's a reference to your pizza-riddled trees.

Jawed

Moloch · Oct 13, 2005

Jawed said:
What's most interesting is that in CoR the X1800XT loses way more performance going from no-AA/no-AF to 4xAA/8xAF than the 7800GTX (at 1024 and 1280).

Since that is, specifically, the inverse of what we see in pretty much every other game, I guess that means there's an awful lot of optimisation to be done in CoR.

Jawed

I think this much we already knew

Nite_Hawk · Oct 13, 2005

sireric said:
There's a rather elaborate system in place, but the issue is that access patterns vary greatly even within one application -- One seen might be dominated by a shader, while another by a single texture and another by geometry (imagine spinning around in a room). You could optimize on a per scene basis, but that's more than we plan at this point (a lot of work). But we do plan on improving the "average" for each application. The basis for this, btw, is us measuring the internal performance of apps (in real time), and then adjusting things based on this. Multiple levels of feedback and thinking involved

Oh, but optimizing on a per scene basis would be what makes the project fun!

Still, I'm glad to hear that you are actually measuring apps in realtime and adjusting things based on that! Here's what I was thinking though:

You could start out writing an internal program to randomly vary your input parameters, take configurations you've written (or better yet, try to generate them based on some hueristic), and then get a quantitative score (memory throughput, fps, etc). You export the data, use something like weka (a nice data mining tool) to build model trees (basically decision trees with a statistical component at the root nodes, which makes them able to handle co-dependent data), and do a 10-fold cross validation test against your input data.

Once you have the model trees built, you are golden. You incorperate them into your drivers and do your internal realtime test, run your model tree, and given all the attributes, pick the configuration that has the best performance. It would be even more interesting to dig into configurations and not treat them as a black box, but rather find out why certain configurations behave better than others and dynamically modify them...

Oh, this sounds fun. You are lucky.

Nite_Hawk

acrh2 · Oct 13, 2005

acrh2 said:
How does the change in MC programs affect scores in other games (d3d and ogl)?
The reason why I am asking this is because there seems to a large jump in frames @ 1600x1200 4xAA for doom3, but a smaller jump for Riddick (relatively speaking).
In other words, will the fix work only for Doom3 and Riddick or will it lower scores in other games?

Why can't I edit my posts?

What I meant to ask was:

In other words, will the fix work for other games, leave other games unchanged, or will it lower scores in other games?

acrh2 · Oct 13, 2005

Nite_Hawk said:
Oh, but optimizing on a per scene basis would be what makes the project fun!

Still, I'm glad to hear that you are actually measuring apps in realtime and adjusting things based on that! Here's what I was thinking though:

You could start out writing an internal program to randomly vary your input parameters, take configurations you've written (or better yet, try to generate them based on some hueristic), and then get a quantitative score (memory throughput, fps, etc). You export the data, use something like weka (a nice data mining tool) to build model trees (basically decision trees with a statistical component at the root nodes, which makes them able to handle co-dependent data), and do a 10-fold cross validation test against your input data.

Once you have the model trees built, you are golden. You incorperate them into your drivers and do your internal realtime test, run your model tree, and given all the attributes, pick the configuration that has the best performance. It would be even more interesting to dig into configurations and not treat them as a black box, but rather find out why certain configurations behave better than others and dynamically modify them...

Oh, this sounds fun. You are lucky.

Nite_Hawk

It would be better to implement multiple configurations via game profiles, like in Forceware drivers.

kemosabe · Oct 13, 2005

Eric, would the 128-bit bus of RV530 theoretically stand to benefit (relatively) more or less from these bandwidth-saving memory controller tweaks?

neliz · Oct 13, 2005

If you actually READ the hexus article, they say that Ogl improvement is overall and ALSO noticable in SS2.

so, so far we have one 30% increase, one 15% and one unbenchmarked one.
I'd love to read the reviews tomorrow.

Rys · Oct 13, 2005

Just a quick note to say (and I'll update the article to say so) that the GTX was a reference board at 430/600, since that makes a difference to some folks and people with higher clocked retail products.

acrh2 · Oct 13, 2005

neliz said:
If you actually READ the hexus article, they say that Ogl improvement is overall and ALSO noticable in SS2.

so, so far we have one 30% increase, one 15% and one unbenchmarked one.
I'd love to read the reviews tomorrow.

I actually READ it.

What about D3D?

Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Nite_Hawk

Joe DeFuria

neliz

GIGABYTE Man

Moloch

God of Wicked Games

sireric

John Reynolds

Ecce homo

Geo

Mostly Harmless

neliz

GIGABYTE Man

Geo

Mostly Harmless

acrh2

Jawed

Jawed

Moloch

God of Wicked Games

Nite_Hawk

acrh2

acrh2

kemosabe

neliz

GIGABYTE Man

Rys

Graphics @ AMD

acrh2

Similar threads