X1800/7800gt AA comparisons

Status
Not open for further replies.
I have to say I'm more intrigued to know if the ultra threaded despatch processor (or whatever it's called) is programmable and what effects we might see from that.

And what kind of relationship exists between the UTDP and the MC. There must be a fair degree of symbiosis there.

Jawed
 
sireric said:
The MC also looks at the DRAM activity and settings, and since it can "look" into the future for all clients, it can be told different algorithms and parameters to help it decide how to best make use of the available BW.
The MC can look into the future? Could you please provide its algorithms to all elevator manufacturing companies? It would be great if an elevator is already there when you press the button.

Thanks in advance.
 
ferro said:
The MC can look into the future? Could you please provide its algorithms to all elevator manufacturing companies? It would be great if an elevator is already there when you press the button.

Thanks in advance.

They probably could make elevators more efficient using prediction.
 
Jawed said:
I have to say I'm more intrigued to know if the ultra threaded despatch processor (or whatever it's called) is programmable and what effects we might see from that.

And what kind of relationship exists between the UTDP and the MC. There must be a fair degree of symbiosis there.

Jawed

The whole thing is one system -- All units depend on each other to work to correctly. The MC requires the clients to have lots of latency tolerance so that it can establish a huge number of outstanding requests and pick and chose the best ones to maximize memory bandwidth (massive simplification).

However, texture ends up being a MC client but also has the shader dependant on it. Consequently, if the MC wants high latency, the shader has to be designed to deal with that.

There are two reasonable ways to deal with that: You can either have large batch sizes of pixels, in which case you hide the latency of fetches, more or less, just by doing the same thing over and over on many pixels before going to the next thing. This would be an architecture that, says, executes the same pixel shader instruction on 1000's of pixels. This works well to hide latency, and is somewhat cheap, area wise. However, it suffers granularity loss, since it has to work in large batches. This would make for a good SM2 type part. The new way, is to make small batches, but have lots of them. So you execute one instruction on a small batch (say 16 pixels), then switch to another instruction and batch until the data for the first one returns. You need to have lots of live threads in this type of architecture, and you need lots of resources (i.e. area) for it to properly hide latency. But, its advantage is that it rules from a granularity standpoint and branching (prime feature of SM3) works perfectly. That's what we did for the R5xx. I believe that the first architecture is more popular for others.

At the end, the whole thing works together. To achieve high memory bandwidth, you need an efficient memory controller design with windowed requests, and clients capable of dealing with long latencies. We did all this, and made our control units very programmable on top -- Since we knew that tuning would be difficult and that we need the flexibility to be able to achieve high efficiency (we did not trust that we would get it right with the first set of settings/prgrms :)). It also allows us to experiment and try new things, so that we'll be more ready for the future.

Edit: corrected some terrible grammar.
 
Last edited by a moderator:
  • Like
Reactions: tEd
trinibwoy said:
Are you testing on an XL as well?
By the time I get some spare time to do XL, R600 will be a $5 part on low end. I need about 256 hours in a day, in both directions. Or something.
 
ERK said:
BTW, my boss would NEVER let me talk about that kind of info. :oops:

Dunno. Nobody's fired me yet. Perhaps tomorrow.

However, our plan is to push to open up the architectures and make things a little more visible, a la CPU. We've been working the GPGPU aspect of things, which can be applied to physics as well; and they want open, low level access. The idea is not to create yet another API for physics, and 2 for graphics, and another for something else, etc... -- We'd rahter a thin layer (required for cross generation compatibility) where lots of things, be they physics or DNA matching or whatever, can be used -- To access the parallel computer sitting there. But to make a thin layer like that, you need to open up your architectures a lot more.

Though, some of the GPUBench stuff will tell you all that I've told you, and lots more. It can tell you batch sizes, how you deal with GPRs, ALUs, etc...
 
The real difference is in the AF to me. ATi's AF looks MUCH better, in those screenies from the [H] review. Its not even close.
 
fallguy said:
The real difference is in the AF to me. ATi's AF looks MUCH better, in those screenies from the [H] review. Its not even close.

I don't know if it's *the* real difference, but I'm happy I got one of my key guys (Tony the texture god) to redesign the AF. We listened (did not always agree) to the criticisms, and we wanted to improve that specifically. Personally, I always run with AA & AF, so I like the quality. I'm glad that the driver guys agreed with me, and pushed to expose this to users for launch.
 
sireric said:
I don't know if it's *the* real difference, but I'm happy I got one of my key guys (Tony the texture god) to redesign the AF. We listened (did not always agree) to the criticisms, and we wanted to improve that specifically. Personally, I always run with AA & AF, so I like the quality. I'm glad that the driver guys agreed with me, and pushed to expose this to users for launch.

So the "pure" AF isn't only software --there was some hardware redesign elements to that vs R3/4? Much of an "ouchie" transistor-wise, or a "good" that mainly flowed from other elements like the MC and threader? What a cool business card --"Tony the Texture God". :LOL:

Edit: What I'm after is the hardware elements, if any, specific to why there appears to be such a small performance hit for the "rotation independent" AF --congrats on that, btw --I'd have been happy if y'all had provided opt-free with a bigger hit as an option; what came out on that score was better than I'd hoped when I thot I was being pollyanna-ish.
 
Last edited by a moderator:
sireric said:
Dunno. Nobody's fired me yet. Perhaps tomorrow.

However, our plan is to push to open up the architectures and make things a little more visible, a la CPU. We've been working the GPGPU aspect of things, which can be applied to physics as well; and they want open, low level access. The idea is not to create yet another API for physics, and 2 for graphics, and another for something else, etc... -- We'd rahter a thin layer (required for cross generation compatibility) where lots of things, be they physics or DNA matching or whatever, can be used -- To access the parallel computer sitting there. But to make a thin layer like that, you need to open up your architectures a lot more.

Though, some of the GPUBench stuff will tell you all that I've told you, and lots more. It can tell you batch sizes, how you deal with GPRs, ALUs, etc...

We heard that a feature called scatter for random memory access was added, what other features have been added for general purpose processing?
 
fallguy said:
The real difference is in the AF to me. ATi's AF looks MUCH better, in those screenies from the [H] review. Its not even close.
Well, duh, because they focused on off-angle surfaces. It really upsets me that nobody focused on off-angle surfaces back when the Radeon 9700 was released and the GeForce4 Ti cards were still beating the pants off of it in anisotropic filtering quality.
 
Chalnoth said:
Well, duh, because they focused on off-angle surfaces. It really upsets me that nobody focused on off-angle surfaces back when the Radeon 9700 was released and the GeForce4 Ti cards were still beating the pants off of it in anisotropic filtering quality.

Go halvsies on therapy with an anger management specialist? I'll do the first 1/2 hour on NV PR and you can have the second 1/2 hr on the unfairness of the history of AF quality.
 
Isn't the danger of opening up low level GPU access, a reduction of abstraction, and therefore, less freedom of GPU implementation technique in the future? Do we really want developers depending on low level details of GPU implementation that should be subject to change, and are will not always be relevant to rendering?


Rather than expose internal GPU workings, I think the better approach is to expose high level APIs for Physics, and certain problems in GPGPU space and then let the driver do the translation work if it can. But exposing the GPU as a general purpose computation device, and promoting performance on GPGPU in PR I think is dangerous. GPGPU performance should be secondary to rendering performance, and should not come at its expense.
 
A small related question.. besides the X1000 (R520 based) series of hardware, what other hardware has the programmable memory controller? (I'm sure that it probably won't get the same amount of performance boosts the X1000 series will get)

It's fascinating how this seems to be a key thing ATI continues to work on... efficency.
 
Last edited by a moderator:
Deathlike2 said:
A small related question.. besides the X1000 (R520 based) series of hardware, what other hardware has the programmable memory controller? (I'm sure that it probably won't get the same amount of performance boosts the X1000 series will get)

The X800/R420 series has a programmable memory controller but it's not as programmable or expansive as the X1000 series' programmable memory controller.
 
ferro said:
The MC can look into the future? Could you please provide its algorithms to all elevator manufacturing companies? It would be great if an elevator is already there when you press the button.

Thanks in advance.

Sure, if you dont mind stopping your internal ticking the moment you press the button and entering "elevator time", where you "tick" based on the availability of the elevator. And having 511 frozen people all with the finger on the button may be a problem, too, when you try to press it to enter elevator time.

Now seriously: may I insist on the question about linux support of this asked a few pages ago? IIRC I read many moons ago the linux driver code tree wasnt quite based on win code, is this right?
 
Status
Not open for further replies.
Back
Top