G80's "Multifunction Interpolator"; Detailed PDF.

Jawed said:
Isn't attribute interpolation performed on the output of the vertex shader?

Slide 9 seems to indicate this: "Plane equations generated by dedicated HW between vertex and pixel shaders".

Jawed
Yep sorry, had my head deep in SPU DMA when I wrote so I didn't think about it very clearly, the PDF was talking about out of the vertex shader into the pixel shader. Pretty much everything I wrote is the same (except its not strictly vertex format but vertex->pixel format, something that not usually expressed in modern GPU but is still there... was much more common to think about with software TnL)

There is also another limit with attributes from the vertex cache into the vertex shader but less serious IIRC (sorry my GPU head isn't on at the moment)
 
Jawed said:
Isn't attribute interpolation performed on the output of the vertex shader?

Jawed

I am not sure about how ATI does it but nVidia use this way:

For each element that needs to interpolate the tri setup calculated a plane equation (ax + by + cz + dw = 0).

In the next step this is reduced to a form that let you calculate the right value for every position (I = xA + yB + C). A, B and C are stored in a on chip memory block that the pixel shader can access.

Every quad batch the rasterizer generates will get a link to the right position in this block.

Every time the pixelshader will later need an interpolated value it will use this link to load the right A, B and C Value and interpolate the values for the 4 Pixel in a quad on the fly.

The advantage of this is that you can save an huge amount of on chip memory because you have to save only 3 values per element and triangle instead of one value per element and pixel. The disadvantage is that you can use only as much interpolated value per clock as your pipeline contains the necessary units. AFAIK the NV4X/G7X chips contains 4 interpolators per pixel. This means that if you have two colors in your vertex data’s and want to add them together in the pixel shader you will need two clocks because you can not interpolate them both in one clock. There is an exception of this rule because it looks like that the same unit that can do the FP16 normalization can do an FP16 color interpolation too.
 
Bob said:
It would be silly to do work that you know is just going to be thrown away.
After I went to bed I realised that this needs to be done in screen space, so it must be done after viewport transform, i.e. after culling/clipping.

Anyway, the presentation is about the per-fragment interpolation, so this question of where the plane equation is generated is beside the point as far as I can tell. Somewhere before rasterisation.

---

Per-fragment interpolation would seem to be executed as a "fixed function" shader preamble, per fragment. It would need to loop over the full set of attributes generated per primitive. I suppose this means that the command processor will generate an "unrolled-loop" that iterates through the set of attributes and sets the destination registers for each interpolation to be the constant registers per fragment.

Jawed
 
Demirug said:
Every time the pixelshader will later need an interpolated value it will use this link to load the right A, B and C Value and interpolate the values for the 4 Pixel in a quad on the fly.

The advantage of this is that you can save an huge amount of on chip memory because you have to save only 3 values per element and triangle instead of one value per element and pixel. The disadvantage is that you can use only as much interpolated value per clock as your pipeline contains the necessary units. AFAIK the NV4X/G7X chips contains 4 interpolators per pixel. This means that if you have two colors in your vertex data’s and want to add them together in the pixel shader you will need two clocks because you can not interpolate them both in one clock. There is an exception of this rule because it looks like that the same unit that can do the FP16 normalization can do an FP16 color interpolation too.
I have to say I'm pretty dubious about on-the-fly interpolation, just because of the extra clock cycles it will cost. Also, any time you have to perform an interpolation, you're losing special function co-issue.

But it's an intriguing concept...

Jawed
 
Thanks for that thread, krychek.

In Xenos:

012l.jpg


we can see the Shader Pipe Interpolators as a fixed function block, fed by the plane equation and VS attributes and "pixel interpolation control" from the Sequencer.

Does that give a clue about how ATI does this in other GPUs? Does the transmission of barycentrics imply on-the-fly interpolation? It does seem like it, doesn't it?

I know that doesn't answer the question about how NVidia hardware does this, but I suppose it's a useful comparison point.

Simon had reservations:

http://www.beyond3d.com/forum/showpost.php?p=35172&postcount=26

and ultimately I don't see how DeanoC's point about the interpolation bottleneck is ameliorated by the Multifunction Interpolator. It might be new and funky, saving transistors - but is it faster?

Unless there's an implication that older designs were slow because with the interpolators being dedicated, NVidia decided to minimise transistors :?:

Jawed
 
Jawed said:
and ultimately I don't see how DeanoC's point about the interpolation bottleneck is ameliorated by the Multifunction Interpolator. It might be new and funky, saving transistors - but is it faster?
Consider that instead of having one interpolator unit and one special-function unit, you'd just have two units that either systems could use, and this for roughly the same transistor budget. If that doesn't improve performance, I don't what else could...

Uttar
 
Jawed said:
and ultimately I don't see how DeanoC's point about the interpolation bottleneck is ameliorated by the Multifunction Interpolator. It might be new and funky, saving transistors - but is it faster?

I was wondering the same thing.

Uttar said:
Consider that instead of having one interpolator unit and one special-function unit, you'd just have two units that either systems could use, and this for roughly the same transistor budget. If that doesn't improve performance, I don't what else could...

And that's a very good explanation for it. Although it seems that a side (or primary?) effect of this combined logic may be a new balance between complexity and speed. Maybe they traded performance for lower complexity in earlier designs so there could be performance gains there as well. See section 2 of second PDF linked in first post.
 
Uttar said:
Consider that instead of having one interpolator unit and one special-function unit, you'd just have two units that either systems could use, and this for roughly the same transistor budget. If that doesn't improve performance, I don't what else could...
That seems reasonable. I think G7x's shader units are asymmetric in terms of special functions, e.g. SU0 can do RCP and SU1 can do SIN/COS.

So, the question is, will G80 be superscalar...

---

There's an interesting comment on slide 24:

Multi-bit fractions provide large grid to be used for multi-sampling based antialiasing

Do the 5 bit coordinate offsets provide, in effect, support for a 32x32 grid for sparse samples? Where's that drool smiley...

Jawed
 
_xxx_ said:
But since your wires need to conduct rather high current nowadays, you'll settle for a compromise which will allow you to have reasonably short lines which can allow for a higher load. Higher load means more heat as well, so having a bigger die area will also allow for more effective cooling. Brute force approach.

(Clearly not on topic, but anyway... ;))

None of that is true.
The relevant wires (that is, interconnections between transistors) are carrying less and less current as processes shrink. This is just Ohm's Law: I = V/R. V is going down and R increases as feature size decreases. It is true that the aggregate current increases but this is due to the quadratic increase in the amount transistors, which is a power grid issue and irrelevant in this discussion as it doesn't impact power density. (In fact, a larger die will increase the IR drop on the grid, another reason to keep your die small.)

For obvious reasons, GPU producers always want to get the maximum possible speed. You suggest that decreasing density can increase this speed. That can only be true if:
1. you have absolutely no other reasonable way to improve your cooling
and
2. you have a lot of timing margin left in your design.

1. is currently not the case, as is clearly demonstrated by tons of after-market cooling kits that are better than the default ones.
2. is in contradiction with the fact that the GPU should run at the maximum possible speed

If you don't have timing margin (which is how all chips are designed), then reducing density will either reduce your maximum speed or increase power density even more (which is exactly what you are trying to avoid) or both.

Your propagation timing (65% of total timing) is ~RC. R ~ Lwire and C ~ Lwire, so timing ~Lwire^2. By increasing Lwire by 10% to decrease power density, timing will deteriorate quadratically. The only way to counteract this is to upsize your transistors accordingly which will increase power density more than the decrease in density you were trying to obtain in the first place.

Cooling is an increasingly annoying problem, but it's not at the point where engineers are hitting a brick wall that's unsolvable without mounting gigantic dust busters.
As long as this is the case, there's no reason to reduce maximal clock speed for power reasons.

_xxx_ said:
Having higher density would have forced R580 to stay below 500MHz I guess.

I pretty sure your guess is wrong.
 
You have dR/df in there as you most probably know. And the losses increase with frequency, hence no Ohm law at all there. (EDIT: duh, just saw you mentioned it already :oops:)

Though I'll accept being proven wrong about the density, I was actually pretty sure that I was right :)
 
Last edited by a moderator:
The density factor has to do mostly with economics. The denser the design, the more die's per wafer, the higher the profit. The denisty of a design is generally bounded by the congestion of wires at particular points. Thus some parts of the design are less(more) dense than others.

I imagine that the motivation for combining functions is not to increase performance, but to reduce the area.
 
silent_guy said:
(Clearly not on topic, but anyway... ;))

None of that is true.
The relevant wires (that is, interconnections between transistors) are carrying less and less current as processes shrink. This is just Ohm's Law: I = V/R. V is going down and R increases as feature size decreases. It is true that the aggregate current increases but this is due to the quadratic increase in the amount transistors, which is a power grid issue and irrelevant in this discussion as it doesn't impact power density. (In fact, a larger die will increase the IR drop on the grid, another reason to keep your die small.)
Is IR drop in the die's power grid at all a relevant problem with flip-chip packages :?:
Your propagation timing (65% of total timing) is ~RC. R ~ Lwire and C ~ Lwire, so timing ~Lwire^2. By increasing Lwire by 10% to decrease power density, timing will deteriorate quadratically. The only way to counteract this is to upsize your transistors accordingly which will increase power density more than the decrease in density you were trying to obtain in the first place.
If you reduce circuit density, then you will end up increasing the average spacing between adjacent wires; this will decrease the amount of capacitance per length unit (which should be pretty obvious from how capacitance works in the first place). With increased spacing, you can also afford to make the wires themselves correspondingly thicker, reducing resistance per length unit as well. As such, I don't see why either wire resistance or wire capacitance would be changed with reduced density.
 
arjan de lumens said:
If you reduce circuit density, then you will end up increasing the average spacing between adjacent wires; this will decrease the amount of capacitance per length unit (which should be pretty obvious from how capacitance works in the first place)

It's obvious that you made a typo, C increases with less distance :)

C = E0 *Er*A/d

(E stands for epsilon, which I don't have handy :))
 
Chalnoth said:
One has to wonder if ATI's putting as much focus in saving die area for their next-gen architecture.

Intrinsity Press Release (February 5 said:
"We're combining ATI's pioneering leadership in consumer technologies with Intrinsity's proven chip-design technology to create innovative products with stunning levels of visualization and integration," said Bob Feldstein, Vice President of Engineering, ATI Technologies, Inc. "We selected Intrinsity after determining that Fast14 Technology can deliver up to four times the performance per silicon dollar when compared with standard design approaches."

I don't think they are thinking about it all. :)
 
Last edited by a moderator:
_xxx_ said:
It's obvious that you made a typo, C increases with less distance :)
I thought that I was saying just that? The "length unit" I referred to was for the length of the wire itself rather than the spacing between the wires (although I do see that I didn't write it all that clearly).
 
Chalnoth said:
Some very cool stuff. It looks like nVidia is working very hard to decrease the die area of their designs. I wouldn't be surprised if the NV40 was actually the first architecture where they really started trying hard to decrease die area (as a quick example, the NV30 and the NV43 have about the same die area, but the NV43 has four times the FP processing power and the same texture filtering power). This seems to have also contributed to the G7x's favorable showing when compared against the R5xx.

One has to wonder if ATI's putting as much focus in saving die area for their next-gen architecture.

Don't forget also that ATI has a memory controller that can do virtual memory and Nvidia has yet to put that in. This is a huge hog of transistors for the current design, but I bet Microsoft appreciates having this in hardware for DX10 testing. :cool:
 
jpm said:
The density factor has to do mostly with economics. The denser the design, the more die's per wafer, the higher the profit. The denisty of a design is generally bounded by the congestion of wires at particular points. Thus some parts of the design are less(more) dense than others.

I imagine that the motivation for combining functions is not to increase performance, but to reduce the area.
But reducing the area of some components of a design allows one to increase the number of components for the same area budget. So it can result in increased performance. Either way, though, one real benefit is reduced power consumption (since even inactive transistors consume power in clocked designs, better use of transistors leads to higher performance for the same power).
 
arjan de lumens said:
Is IR drop in the die's power grid at all a relevant problem with flip-chip packages :?:

Flip chip definitely helps, but lower voltages increase the problem, so it's still something you have to take into account. I know we still verify it with VoltageStorm before tape-out, whereas we didn't do this for larger processes.

arjan de lumens said:
If you reduce circuit density, then you will end up increasing the average spacing between adjacent wires; this will decrease the amount of capacitance per length unit (which should be pretty obvious from how capacitance works in the first place).

Decreased density lowers the capacitance per length between 2 signal wires. However, if you increase the length, you will counteract this effect. Since the signal wire to signal wire is only 1 part of many stray capacitances that don't change with decreased density (wire to poly, wire to power gird etc.), the net effect of decreasing density is still an increase in load.

arjan de lumens said:
With increased spacing, you can also afford to make the wires themselves correspondingly thicker, reducing resistance per length unit as well. As such, I don't see why either wire resistance or wire capacitance would be changed with reduced density.
True, but increasing the width of the wires would also make the capacitance higher again ;), so it's either one or the other.

I have never really seen this done for the regular interconnect (on some seriously power consuming router chips). When you layout a chip for a particular process, the fab will provide you with a file that will specify in detail design rules to be followed for each layer. Those design rules are fed into the P&R tool and usually strictly followed. One thing that is somewhat common, is to treat specific nets (e.g. clocks or async resets) differently for signal integrity reasons by using double empty spacing around it (to prevent LC coupling.)

It's probably true that an x increase will increase your propagation time less than x^2. But from there it's still a big step to claim that you'll end with a faster chip overall.
 
Back
Top