The Official NVIDIA G80 Architecture Thread

There's a fair number of niggly errors, definitely v1.0-itis, but still very readable.

Jawed
 
Anyone noted these quite moderate occlusion results:

d3drmhsr2146e4c0bf2.jpg


d3drmhsr1146d520xo5.jpg


d3drmhsr31476dd6yr0.jpg


d3drmhsr41478f1aop6.jpg
 
Last edited by a moderator:
Those numbers (I'm guessing which you mean, since they aren't actually showing) on their own don't seem to mean much as far as I can tell.

I do wonder, though, if this might be why G7x is so poor at Oblivion's foliage?

Jawed
 
I think that's the case, I also believe that it can just rasterize 4 2x2 quads coming from different primitives not sharing any edge between them, you really don't want to run at 25% efficiency if your triangles are small :)

Err, isn't it 8 quads to a batch?

Can it reorder batch elements on the fly? I believe Chalnoth threw that out, and with 2x16 it ought to be *possible* (though complex) to swap a single 16 section around. I suppose the test for that is fairly simple -- two consecutive branches corresponding to, say, even 16x2 regions and odd 16x2 regions, respectively. I wouldn't expect it to manage that, but it would be cute if it did.

Is the 32 batch size dictated by 8 texture units x 4 color components? That would seem to imply vector instructions issued wide, rather than serialized, which seems incredibly unlikely to me. Or is there something important about keeping a single texture unit across a whole quad? /me lost :(
 
HSR speed can be determined quite accurately from Humus' GL_EXT_reme benchmark. Just subtract the render times (i.e. 1/fps) between the 8x overdraw and 3x overdraw (front to back) and multiply by the core clock. Then divide 5 times the screen res by that number and you get the rejection rate per clock.

I've found it works really since R300, maybe even earlier.
 
I looked further into the papers related to the interpolator/SF unit by Stuart Oberman et al., and I was partially wrong, but so was psurge, that I can see :) It's really 4 units per cluster, each capable of interpolating an attribute for one quad of pixels per clock, or doing a 3 hybrid iterations minimax quadratic approximation for one value for one pixel per clock.

So our values were definitely right, and all the units are shared for interpolation and SF; but the hardware internally works on quads, and does some really smart tricks to reuse that for a 3 hybrid iterations approximation for SF. Clearly, there's some hardware being idle in either case (SF or Attribute Interpolation), but overall, it does still look like a very efficient tradeoff.

As for HSR and HierZ, I got a few ideas on how that works internally, but I still need some testing time.


Uttar
 
Unfortunately, we may have to wait for some time. (why doesn't gpubench ps3.0 dynamic branching test work on R5xx cards? If it's an issue with nv_fragment_program support, why not use glsl?)

Because branching doesn't work like we want under GLSL with ATI... They try to unroll loops and predicate branches in all public drivers. We have DX versions of the test, they used to behave, but there was a subtle change in the DX spec that is breaking the test for *both* vendors, so we need to figure out how to fix that...
 
Isn’t this the logical result of having scalar units?
Yeah - but I'm not convinced Uttar is thinking through the rates correctly.

There are 16 attribute interpolation "pipes", but only 4 SF pipes per cluster. 4 interpolation pipes are ganged per SF pipe.

Jawed
 
(This was posted as a separate thread, but should really belong here, sorry for that..)

The application I am working on performs huge amounts of volume texture lookups.
Think millions of on-the-fly volume gradients, each requiring up to 7 volume texture lookups, each with trilinear interpolation.

If you look up so many gradients, why not precompute them into a second volume texture?
 
I wasted the last 3+ hours digging through the Stuart Oberman papers, so I'm pretty confident I got them right now, actually... :)

You have 4 units per cluster. Each of them can interpolate 1 scalar attribute per clock for one quad (->4 scalar attribute interpolations going on there, or 16 total for the cluster) OR apply a special function to one scalar value for one pixel, per clock.

Since the local scheduler's minimum granularity is 16 objects, I would assume that it sees those four "units" as a single entity, and a SF on 16 objects would take 4 cycles in its eyes, while a scalar interpolation of one attribute for 16 objects would take only one cycle.


Uttar
 
One question:
Pixel shaders work in quads, but vertex shaders don't. Does this mean that the "special functions" will be 4 times more expensive when used in vertex shader than in pixel shader?
 
I wasted the last 3+ hours digging through the Stuart Oberman papers, so I'm pretty confident I got them right now, actually... :)
I invested the time a few weeks back ;)

You have 4 units per cluster. Each of them can interpolate 1 scalar attribute per clock for one quad (->4 scalar attribute interpolations going on there, or 16 total for the cluster) OR apply a special function to one scalar value for one pixel, per clock.
Agreed. Actually, to be fair, it's Rys's diagram:

http://www.beyond3d.com/reviews/nvidia/g80-arch/image.php?img=images/g80-diag-full.png

which I think may be confusing things, at "10". The SF and interpolation rate is the same (i.e. the pipeline is equal in length for both), but it's the 3 auxilliary paths dedicated to interpolation that result in the 4x multiplier - rather than SF's taking 4x longer to compute.

It's clearer to think of 4 quad dedicated interpolation pipes, with each quad dependent upon a single SF pipe.

Since the local scheduler's minimum granularity is 16 objects, I would assume that it sees those four "units" as a single entity, and a SF on 16 objects would take 4 cycles in its eyes, while a scalar interpolation of one attribute for 16 objects would take only one cycle.
Yeah.

Jawed
 
Tell me about it, just something else, if the g80 was planned for last year, what else does nV have planned?
Vista's release date has been pushed way, way back. ATI and NVidia should both have been planning on releasing something a long while back. What they've each done with the extra time, who knows?...

It's also worth pondering how recently stuff was last added into (or more likely) taken out of D3D10.

Jawed
 
True, but if ATi is 4 months behind right now, I don't think we have to look to far to see where they were along the way, if nV's problem was due to chip size. Which in all honesty doesn't really make much sense to me if it was chip size that means yeilds of the g71 weren't as good as nV was saying orginally.......

But that doesn't make much sense either because thier net profits are threw the roof even with the increased recall numbers.
 
Last edited by a moderator:
Jawed, I'm not sure where you got that impression, but it looks horribly wrong to me.
As reference, I'll take slide 33 of this presentation: http://rnc7.loria.fr/oberman_invited.pdf (which is a really nice presentation btw, fwiw, nice overall architecture/algorithms summary.)
Slide 31 is also of some relevance. And if you actually want the original paper, it's all there still: http://66.102.9.104/search?q=cache:...ction+arith17&hl=en&ct=clnk&cd=1&client=opera
And there's also the original Stuart Oberman paper on enhanced minimax quadratic approximation if the problem is you aren't sure how the algorithm for it works interally. It's basically the same for the multifunction unit, just with some smart sharing.

What you don't seem to be realizing is that what you call the "auxiliary" paths are actually (partially) used for SF. But SF needs to do 3 iterations (it's not strictly three iterations; the second iteration is very specific, as well as some other things, which is why the paper calls it "three hybrid passes" - the details are available on page 3 of the paper if you feel like wasting some time...) - so, it makes use of two of the "auxiliary" paths for this. Considering there is at least one other "auxiliary" path available, it's easy to see that the algorithm could be extended to 4 iterations if needed for higher precision in the future, although the only use of that would be GPGPU, imo. None of the papers even allude to that possibility.

Rys' diagram, just like mine, assumes 16 units that need 4 clocks for SF and 1 for interpolation, but clearly it's 4 units (or a bigger one, from the scheduler's POV as I said above) that does one pixel quad of interpolation or one SF per clock. At least as far as I can see, of course.


Uttar
P.S.: I'm now 99% sure that the MUL is doing Special Function setup (to put the values in range). The patents clearly hint also at the MUL functionality of the multipurpose ALU being put to use for that. Finally, I think that except for CUDA, it makes sense to only expose the MUL when you're doing SF, as the MUL would be idling 3/4th of the time when it has to setup a sincos etc., since the SF couldn't keep up. It'd be interesting if they could expose it more generally in the future though, especially in the VS - hmm.
 
Back
Top