G80's "Multifunction Interpolator"; Detailed PDF.

Ailuros · Jul 9, 2006

Chalnoth said:
Oh, I'd be willing to bet that nVidia's texture units take up more total die area than ATI's. Remember that nVidia has 24 of 'em.

That wasn't my point; imagine how transistor count would had risen in G7x if it would have had the same texture samplers as in NV2x/3x. In my mind the higher angle dependency since NV4x is a clear transistor saving design decision. Even more so since the design was set to scale beyond 4 quads.

The dynamic branching optimizations are likely the most costly. It is for this reason that I suspect that nVidia's next part isn't going to be as good as the R5xx at dynamic branching. The memory controller is also likely a big part.

But still, ATI's R580 is sitting at roughly twice the die area for a part with similar performance and featureset when compared to nVidia's G71. I don't buy that you can account for all of this just by the G71's relatively fewer features.

Don't you think that the ballpark between 160M on R4x0 and 384M on R580 is a tad too wide to attribute the majority on "just" SM3.0+dynamic branching+memory controller as an example?

KimB · Jul 9, 2006

Except that optimizing for dynamic branching takes a huge amount of extra die space because much, much more state needs to be saved at any one time. The new memory controller also takes up a significant portion, which I believe is due to much larger caches.

Now, I do think that ATI felt their design decisions were justified. If I'm correct about what's taking up the extra die space, I can think of reasons why those decisions would have been made:

1. ATI designed the R5xx architecture with the idea that they weren't going to be able to make the architecture much wider than the R4xx without running into memory bandwidth limitations, hence remaining at 16 pipelines.
2. ATI had a fairly large transistor budget with the move to 90nm, and they wanted to get as much performance as possible out of the architectcure. Given the memory bandwidth limitations, they decided to spend much more die area than usual on the memory controller (this is what I believe results in the R5xx's high performance in FSAA).
3. Similar logic would have gone into the multithreaded nature of the pipelines: improving efficiency would be better than just adding more pipelines, as more pipelines would require more memory bandwidth (note that this is why dynamic branching is so quick, but it should also help with rendering smaller polygons).

In other words, I think that ATI spent a whole lot of transistors in improving efficiency.

_xxx_ · Jul 9, 2006

geo said:
Well, it seems clear that ATI's current designs are less dense (transistors per mm2) than NV's to start with. It's less clear why this is so, tho theories get tossed out from time to time. I don't recall that ATI has ever addressed the point directly, tho possibly you could make some inferences from comments on why the ring bus is around the outer part of the die.

Less density makes higher clockspeeds possible.

_xxx_ · Jul 9, 2006

4. They introduced new costly technologies which will be used in more or less similar form throughout their next 2-3 product iterations, where the cost will be lower through shrinking etc?

DeanoC · Jul 9, 2006

Chalnoth said:
Attribute interpolation is referring to the interpolation of per-vertex attributes to calculate the per-pixel values (such as color, depth, texture coordinates, normal vectors). See slide 9 of the presentation.

Its a major bottleneck on current NVIDIA hardware... which would explain why a new funky unit would be top of there list of improvments for a new architecture...

Jawed · Jul 9, 2006

DeanoC said:
Its a major bottleneck on current NVIDIA hardware... which would explain why a new funky unit would be top of there list of improvments for a new architecture...

Why major? How does that come about?

Is it connected with the setup rate of one triangle every 2 clocks?

Jawed

3dcgi · Jul 9, 2006

Chalnoth said:
Oh, I'd be willing to bet that nVidia's texture units take up more total die area than ATI's. Remember that nVidia has 24 of 'em.

I wonder how much savings Nvidia gets from having one of the ALUs share some texture processing duties.

Ailuros · Jul 9, 2006

Chalnoth said:
Except that optimizing for dynamic branching takes a huge amount of extra die space because much, much more state needs to be saved at any one time. The new memory controller also takes up a significant portion, which I believe is due to much larger caches.

Now, I do think that ATI felt their design decisions were justified. If I'm correct about what's taking up the extra die space, I can think of reasons why those decisions would have been made:

1. ATI designed the R5xx architecture with the idea that they weren't going to be able to make the architecture much wider than the R4xx without running into memory bandwidth limitations, hence remaining at 16 pipelines.
2. ATI had a fairly large transistor budget with the move to 90nm, and they wanted to get as much performance as possible out of the architectcure. Given the memory bandwidth limitations, they decided to spend much more die area than usual on the memory controller (this is what I believe results in the R5xx's high performance in FSAA).
3. Similar logic would have gone into the multithreaded nature of the pipelines: improving efficiency would be better than just adding more pipelines, as more pipelines would require more memory bandwidth (note that this is why dynamic branching is so quick, but it should also help with rendering smaller polygons).

No doubt about that; still not enough to count for the roughly ~90M transistors difference between R580 and G71.

In other words, I think that ATI spent a whole lot of transistors in improving efficiency.

R4x0 wasn't by far inefficient as a SM2.x design. The transistor budget doubled between that and R520 because of a bundle of factors, SM3.0, memory controller, less AF angle dependency etc. included and R580 came with 3 times the ALU units of that. Between R520 and G71 the transistor difference is merely around 25M transistors; whereby R580 has more floating point power that it can actually ever use for it's lifetime.

Yes the latter ends up with ~90M transistors than G71, but then again it has in theory also roughly 50% more floating point throughput. Now if you want to be fair try to add as many ALUs to a hypothetical G71 design that can reach a theoretical maximum of 374 GFLOPs and then we'll see where the transistor count would lie there. As I said above I might consider it redundant right now, but I don't see any wasted transistors in the R5x0 GPUs.

Mintmaster · Jul 9, 2006

Chalnoth said:
Now, I do think that ATI felt their design decisions were justified.

I agree with you, and I also think ATI's decision is better for the evolution of 3D graphics. Unfortunately, the real world doesn't work that way.

ATI is a business. You can't run it on such idealistic goals, especially if the consumer won't recognize them. You need high performance in today's games and maybe also for the near future, but anything beyond that is pointless for your bottom line.

As for why I think it's good for graphics, there is plenty of math and texture power right now. R580 gives you 400 shader cycles and 130 texture lookups per pixel for 1280x1024 @ 60fps. Even accounting for inefficiency, overdraw, offscreen rendering, etc. that's gobs of power. DB will make a lot better use of it.

But ATI can't lose sight of what they're in the business for. Fast DB may eventually be worth 30% fewer pipelines, but we certainly weren't there in mid 2005 (R5xx target release date), and we'd be lucky to get there in the middle of next year.

Arun · Jul 9, 2006

Mintmaster said:
and we'd be lucky to get there in the middle of next year.

Or rather, we won't get there at all, if NVIDIA doesn't significantly (or rather, tremendously) reduces their pixel shader's branching granularity in the G8x. And I've got a feeling about that...

Uttar

silent_guy · Jul 9, 2006

_xxx_ said:
Less density makes higher clockspeeds possible.

Why???

KimB · Jul 9, 2006

Uttar said:
Or rather, we won't get there at all, if NVIDIA doesn't significantly (or rather, tremendously) reduces their pixel shader's branching granularity in the G8x. And I've got a feeling about that...

Uttar

Oh, I think that it'll be reduced pretty significantly. But I'd be willing to bet that they don't want to make the granularity as good as the R5xx. Spending too many transistors on the feature may not make sense when one can just add more functional units.

KimB · Jul 9, 2006

silent_guy said:
Why???

Less density will make higher clockspeeds possible if (and only if) your primary limitation is heat dissipation.

silent_guy · Jul 9, 2006

Chalnoth said:
Less density will make higher clockspeeds possible if (and only if) your primary limitation is heat dissipation.

That sounds logical at first, but it isn't. These days, the first order factor to determine speed is your wire load (about 65% vs 35% transistor switching speed). You want to keep this down as much as possible by keeping distances short. The biggest problems in efficient cooling are at thermal barriers, e.g. between the die and the copper etc. In comparison, intra-die thermal gradients are almost negledible.

Anyway, even if reduced density would help thermal characteristics somewhat, you would need to increase the driver strength of the output transistors of the gates driving the longer wires, which would increase power consumption more with a net increase in thermal stress as a result.

Nobody deliberatly reduces densities. It's bad for speed, bad for area and bad for yield.

DeanoC · Jul 9, 2006

Jawed said:
Is it connected with the setup rate of one triangle every 2 clocks?

No its not really connected, its a different bottleneck. Much more serious in practise than being triangle setup limited.

The hardware can interpolate a certain number of attributes per clock that have to feed all its vertex shaders. So if your vertices (its a combination of vertex format and vertex shader) need more attributes interpolating than the interpolator unit can supply you get a bottleneck.

LeStoffer · Jul 9, 2006

Uttar said:
If that isn't interesting, I don't know what is.
And it's written by Stuart Oberman, who was architect for the FPU of AMD's Athlon microprocessor. He's currently Principal Engineer at NVIDIA.

Yeah, very cool find. Maybe the G80 turns out to be much less "boring" than I originally thought it would be.

Snip from their conclusion:

Supports high-order function evaluation on an attribute interpolation datapath for only about a 20% increase in area.

Jawed · Jul 9, 2006

DeanoC said:
The hardware can interpolate a certain number of attributes per clock that have to feed all its vertex shaders.

Isn't attribute interpolation performed on the output of the vertex shader?

Slide 9 seems to indicate this: "Plane equations generated by dedicated HW between vertex and pixel shaders".

Jawed

Jawed · Jul 10, 2006

OK, what's intriguing me now is where does this "attribute interpolation program" run :?:

It seems to me that NVidia is converting attribute interpolation from a fixed function block of hardware twixt VS and PS (or GS and PS in D3D10 GPUs) into a program that's issued.

I dare say this program will run in the GS pipeline, "once per primitive". Obviously if GS amplifies (or deletes) primitives then the program is only executed for the primitives that GS will output.

The other intriguing thing is that this program has an implied loop, over the count of attributes output by the vertex shader. I suppose it's also worth noting that this means regardless of whether the programmer specifies a GS program or not, the "attribute interpolation" GS program is executed.

This does imply that attribute interpolation occurs before culling/clipping though - whereas I guess in earlier GPUs interpolation happens after culling/clipping.

Jawed

Bob · Jul 10, 2006

This does imply that attribute interpolation occurs before culling/clipping though - whereas I guess in earlier GPUs interpolation happens after culling/clipping.

It would be silly to do work that you know is just going to be thrown away.

_xxx_ · Jul 10, 2006

silent_guy said:
That sounds logical at first, but it isn't. These days, the first order factor to determine speed is your wire load (about 65% vs 35% transistor switching speed). You want to keep this down as much as possible by keeping distances short. The biggest problems in efficient cooling are at thermal barriers, e.g. between the die and the copper etc. In comparison, intra-die thermal gradients are almost negledible.

Anyway, even if reduced density would help thermal characteristics somewhat, you would need to increase the driver strength of the output transistors of the gates driving the longer wires, which would increase power consumption more with a net increase in thermal stress as a result.

Nobody deliberatly reduces densities. It's bad for speed, bad for area and bad for yield.

But since your wires need to conduct rather high current nowadays, you'll settle for a compromise which will allow you to have reasonably short lines which can allow for a higher load. Higher load means more heat as well, so having a bigger die area will also allow for more effective cooling. Brute force approach.

Having higher density would have forced R580 to stay below 500MHz I guess.

G80's "Multifunction Interpolator"; Detailed PDF.

Ailuros

Epsilon plus three

KimB

_xxx_

_xxx_

DeanoC

Trust me, I'm a renderer person!

Jawed

3dcgi

Ailuros

Epsilon plus three

Mintmaster

Arun

Unknown.

silent_guy

KimB

KimB

silent_guy

DeanoC

Trust me, I'm a renderer person!

LeStoffer

Jawed

Jawed

Bob

_xxx_