Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Nite_Hawk · Oct 13, 2005

acrh2 said:
It would be better to implement multiple configurations via game profiles, like in Forceware drivers.

That would only let you choose a configuration based on the game, not based on the scene. Besides, you then are tied to only use a pre-existing configuration. The idea is that you want this to be dynamic and automatic. The user has no idea what configuration will result in the optimal speed increase without trying all of them. A model tree will let you take a bunch of input parameters (like rops) and try to predict based on all of those attributes what the best configuration would be to use. What's better is that you can re-run the model tree as often as you like to determine if you are still using the optimal configuration given the current input parameters.

Please don't take this the wrong way, but game profiles ala the forceware drivers would be a rather crude solution in comparison.

Nite_Hawk

sireric · Oct 13, 2005

kemosabe said:
Eric, would the 128-bit bus of RV530 theoretcially stand to benefit (relatively) more or less from these bandwidth-saving memory controller tweaks?

It's even more sensitive to these programs/parameters. I'm not sure the effect of this quick fix, but I would guess it would help. Once we've done a first round of tuning, I'm certain it will be benefits too.

MulciberXP · Oct 13, 2005

does this fix have a negative impact on D3D performance?

sireric · Oct 13, 2005

MulciberXP said:
does this fix have a negative impact on D3D performance?

This is independant (completely) of D3D -- D3D has a different driver and different MC settings systems. Though, there are lots of D3D improvements coming up too...

kemosabe · Oct 13, 2005

sireric said:
It's even more sensitive to these programs/parameters. I'm not sure the effect of this quick fix, but I would guess it would help. Once we've done a first round of tuning, I'm certain it will be benefits too.

Thanks - best of luck to you and Terry's team in extracting all the performance juice from this discovery.

Nite_Hawk · Oct 13, 2005

Btw Sireric,

When you guys implemented the programmable memory system did you have any idea you'd be able to get this kind of speed increase out of it? That seems like a very bright decision after the fact... Is someone getting a raise?

Nite_Hawk

Skrying · Oct 13, 2005

Very awesome, makes it hard to find a fault now in the X1800XT (once its available) performance wise.

I'm curious, how programmable is the 7800GTX memory controller compared to the X1800s.

I'm really looking foward to more tweaks in the MC, seems there is still a lot of performance to be unlocked in the X1Ks.

jb · Oct 13, 2005

sireric said:
No luck at all. More like being dumb for not doing this earlier. It was very simple. It just required people to start looking into what is going on.

\

Bah!!! I have been on many product launches and this type of stuff always happens. Every time we get to go back and look at a product I can start shaving cost or increase profromance so dont feel bad. When your under the gun you go with what works and meets specs first, then come back and make it mo' better

Keep up the good work!

Rys · Oct 13, 2005

Nite_Hawk said:
That seems like a very bright decision after the fact... Is someone getting a raise?

Joe doesn't need any more Ferraris

Bouncing Zabaglione Bros. · Oct 13, 2005

The new memory controller has just become ATI's (no longer secret) superweapon. The potential for strongly improving performance in all sorts of places has just become pretty significant.

sireric · Oct 13, 2005

Nite_Hawk said:
Btw Sireric,

When you guys implemented the programmable memory system did you have any idea you'd be able to get this kind of speed increase out of it? That seems like a very bright decision after the fact... Is someone getting a raise?

Nite_Hawk

We learned a lot from previous architectures on what works and what doesn't. Could I of guessed a change would cause D3 up 30%? Well, I expected off the bat, that D3 should of been much better than it was. And we still have lots of work.

trinibwoy · Oct 13, 2005

Rys said:
Just a quick note to say (and I'll update the article to say so) that the GTX was a reference board at 430/600, since that makes a difference to some folks and people with higher clocked retail products.

Are you testing on an XL as well?

Nite_Hawk · Oct 13, 2005

sireric said:
We learned a lot from previous architectures on what works and what doesn't. Could I of guessed a change would cause D3 up 30%? Well, I expected off the bat, that D3 should of been much better than it was. And we still have lots of work.

Well, good luck with it! I have to say I'm a bit envious. You don't get to performance gains like this on established products very often. It makes you feel almost giddy. May be some day I'll be able to sell some of you guys at ATI or nVidia on my data mining performance ideas. I actually built some models and wrote a paper in college to predict framerate score based on videocard/cpu/memory configurations. It was really neat.

Nite_Hawk

Jawed · Oct 13, 2005

I have to say I'm more intrigued to know if the ultra threaded despatch processor (or whatever it's called) is programmable and what effects we might see from that.

And what kind of relationship exists between the UTDP and the MC. There must be a fair degree of symbiosis there.

Jawed

ferro · Oct 13, 2005

sireric said:
The MC also looks at the DRAM activity and settings, and since it can "look" into the future for all clients, it can be told different algorithms and parameters to help it decide how to best make use of the available BW.

The MC can look into the future? Could you please provide its algorithms to all elevator manufacturing companies? It would be great if an elevator is already there when you press the button.

Thanks in advance.

AlphaWolf · Oct 13, 2005

ferro said:
The MC can look into the future? Could you please provide its algorithms to all elevator manufacturing companies? It would be great if an elevator is already there when you press the button.

Thanks in advance.

They probably could make elevators more efficient using prediction.

Geo · Oct 13, 2005

http://www.elevator-world.com/magazine/archive01/9606-001.htm

sireric · Oct 13, 2005

Jawed said:
I have to say I'm more intrigued to know if the ultra threaded despatch processor (or whatever it's called) is programmable and what effects we might see from that.

And what kind of relationship exists between the UTDP and the MC. There must be a fair degree of symbiosis there.

Jawed

The whole thing is one system -- All units depend on each other to work to correctly. The MC requires the clients to have lots of latency tolerance so that it can establish a huge number of outstanding requests and pick and chose the best ones to maximize memory bandwidth (massive simplification).

However, texture ends up being a MC client but also has the shader dependant on it. Consequently, if the MC wants high latency, the shader has to be designed to deal with that.

There are two reasonable ways to deal with that: You can either have large batch sizes of pixels, in which case you hide the latency of fetches, more or less, just by doing the same thing over and over on many pixels before going to the next thing. This would be an architecture that, says, executes the same pixel shader instruction on 1000's of pixels. This works well to hide latency, and is somewhat cheap, area wise. However, it suffers granularity loss, since it has to work in large batches. This would make for a good SM2 type part. The new way, is to make small batches, but have lots of them. So you execute one instruction on a small batch (say 16 pixels), then switch to another instruction and batch until the data for the first one returns. You need to have lots of live threads in this type of architecture, and you need lots of resources (i.e. area) for it to properly hide latency. But, its advantage is that it rules from a granularity standpoint and branching (prime feature of SM3) works perfectly. That's what we did for the R5xx. I believe that the first architecture is more popular for others.

At the end, the whole thing works together. To achieve high memory bandwidth, you need an efficient memory controller design with windowed requests, and clients capable of dealing with long latencies. We did all this, and made our control units very programmable on top -- Since we knew that tuning would be difficult and that we need the flexibility to be able to achieve high efficiency (we did not trust that we would get it right with the first set of settings/prgrms

). It also allows us to experiment and try new things, so that we'll be more ready for the future.

Edit: corrected some terrible grammar.

ERK · Oct 13, 2005

This is great.
Heh, isn't B3D a great site?

ERK · Oct 13, 2005

BTW, my boss would NEVER let me talk about that kind of info.

Recent Radeon X1K Memory Controller Improvements in OpenGL with AA

Nite_Hawk

sireric

MulciberXP

sireric

kemosabe

Nite_Hawk

Skrying

S K R Y I N G

jb

Rys

Graphics @ AMD

Bouncing Zabaglione Bros.

sireric

trinibwoy

Meh

Nite_Hawk

Jawed

ferro

AlphaWolf

Specious Misanthrope

Geo

Mostly Harmless

sireric

ERK

ERK

Similar threads