The Official NVIDIA G80 Architecture Thread

I find it interesting how, in retrospective, you can notice he actually managed to bypass the question (in that specific interview) by criticizing aspects of Xenos that they were going to handle differently in G80... ;)
Since Xenos seemingly has 3 register files, I'm confused by the distinction you're alluding to.

You still got the extra costs of the global scheduler/dispatcher/whatever though, and I can't really imagine how expensive that would be. I'd guess "A bit, but not dramatically so." but I really don't know - anyone has any opinion on this?
Until we see evidence of the load-balancing actually dynamically operating, I don't see how we can talk about G80's scheduling expense.

I haven't read the architecture article closely enough, though, so perhaps you've written code that explicitly explores dynamic load-balancing (VS versus PS) in G80 :oops: - tomorrow I'll have a proper go...

Jawed
 
Since Xenos seemingly has 3 register files, I'm confused by the distinction you're alluding to.
He's talking about the schedulers. How many schedulers there are on Xenos?
Until we see evidence of the load-balancing actually dynamically operating, I don't see how we can talk about G80's scheduling expense.
It seems to me that some test out there already confirm this.
 
D. Kirk: In the logical diagram of D3D 10, Vertex Shader, Geometry Shader and Pixel Shader are placed side by side. What happens if they are placed in the same box? Each Shader is a different part. If they get unified they become wasteful.

Besides, it requires more I/O (wires) because all connections with memory concentrate on the box. Registers and constants are put in a single box too. It's because you have to keep all vertex states, pixel states and geometry states together while doing load balancing. A bigger register array requires more ports.
Seems to me he's talking about the register file, the "fact" it's shared, the "fact" it's over-sized and the "fact" it requires more ports.

As far as I can tell the gist of what he's saying is that simply to slap all thread types into one register file across the entire GPU is wrong.

Since neither Xenos nor R5xx do that, I guess I could have the wrong end of the stick, but I can't figure out what he's saying if he's not saying that.

Jawed
 
the hooded guy with the axe and the water demos are incredibly sweet. look very much like prerendered CG.

suppose it'll be 3-4 years before we see games with this level of detail.

Looks to be straight from project offset to me which last I heard was aimed at a 2007/2008 timeframe.

EDIT: Great article BTW, probably the most detailed I have ever seen here at B3D and no doubt will be the benchmark architecture guide throughout G80's life.

And as for the GPU... WOW! Seems just about everyone agrees this is the God of GPU's! Im getting a GTS myself, hopfully 8xQ Transparency AA and 16x HQAF wiil be reachable in all games under my target res of 1280x1024 or 1360x768.
 
Last edited by a moderator:
Uttar - IMO you're wrong because if you actually had 16 of the full precision units (i.e. able to output 16fp32 interpolants per clock), then the Ax + By + C computation is pipelined. Since SFs just take y = x*x and looks up the coefficients A,B,C based on x and SF from a LUT, you would be able to output 16 SF results per clock, hence my conclusion. (This is all assuming that it works the way described in the paper of course).

Here's the link to the slides and paper:
Edit: links are dead :(
http://arith17.polito.it/foils/11_2.pdf
http://arith17.polito.it/final/paper-164.pdf
The architecture detailed there is a unit that can do:
  • one SF per clock
  • 4 attribute channels per clock - at the rate of one channel for four pixels in parallel
Since an interpolant consists of multiple channels (e.g. RGBA for colour), it will naturally take more than one clock to produce an attribute.

Hope I'm not diving into the wrong part of the argument...

Jawed
 
Last edited by a moderator:
OK, so is it safe to assume that the 32-fragment batch size in G80 is a direct result of the setup/rasteriser engine?

The "16x2" pattern seems remarkably similar to the ATI patent for a rasteriser that "walks" two rows (or two columns, depending on the orientation of the triangle, if it's skinny?) at a time.

Jawed
 
My theory was correct!, this was to be the flagship and the g70,71 were just back ups. well not to mention the g70 would have been the refresh of the nv40

That's one hell of a backup they pulled out of their collective arsses :)
 
CBs, under D3D10, are more generalised than DX9's constant support. Is there a chance that G80 implements DX9 constants using the D3D10 CB architecture?

If so, shouldn't it be possible to test the CB architecture by creating a large set of pixel shaders, each with a full set of 32 constants (that's the DX9 limit, isn't it?) and then issuing very small batches (e.g. 32-fragment quads), with each batch being bound to a separate shader?

Will DX9's render state switching overhead totally clobber such a test, though? :???:

Jawed
 
We have some more shading results here:
http://www.digit-life.com/articles2/video/g80-part2.html

Interestingly, G80 does not destroy R580+ in dynamic branching. It's only 14% faster at steep parallax mapping. The fur shader performance is expected given the texturing advantage.

In some other complex tests (3-light shading, ordinary parallax mapping) G80 is only around 50% faster than R580, but I suppose that's expected since the stream processors are equivalent to 66 shader pipes at 650MHz in vector heavy code.
 
Disappointed that the article doesn't tackle the subject of in-order versus out-of-order fragment shading - the stuff that the PIOR flag in the patent relates to.

Jawed
 
Are you sure of that? Our tests indicate that the inefficiency due to working on quads and not on pixels is roughly the same on G8x as it is on G7x, and the branching tests Rys did clearly seemed to indicate 16x2 as what the rasterizer tries to output, not 8x4. Of course, we could have done something horribly wrong, although I'll admit not to see what that could be... :)
Uttar

GPUBench has just been updated in CVS for rectangular branching patterns. We currently see 16x4 as the branch pattern with the drivers used, at least to match perfect coherence performance. The results are already up on the GPUBench site, along with 7900GTX (finally). There have been several bug fixes to GPUBench and improvements on the instruction issue and branching tests. I'm going to try to push out a new binary and source release to sourceforge by tomorrow morning.
 
I think some of you should use this thread to say how incredibly wrong you were in your predictions of the G80 architecture. It's almost embarrassing how off everyone was. Nvidia fooled you guys good.
 
Easily admitted.

I will be interested in seeing whether that MUL unit gets a greater workout in the future....

I also look forward to future articles and R600 leaks :>
 
I think the Z/Stencil and AA sample fill-rates numbers are insane.

Insane!
 
We have some more shading results here:
http://www.digit-life.com/articles2/video/g80-part2.html

Interestingly, G80 does not destroy R580+ in dynamic branching. It's only 14% faster at steep parallax mapping. The fur shader performance is expected given the texturing advantage.

In some other complex tests (3-light shading, ordinary parallax mapping) G80 is only around 50% faster than R580, but I suppose that's expected since the stream processors are equivalent to 66 shader pipes at 650MHz in vector heavy code.

I think it's probably too early to judge, given no DX10 drivers, and extreme immaturity of drivers. We don't really know what effect code transformation has on the architecture. For example, the HLSL compiler tries to "auto-vectorize" some shaders by packing registers and combining instructions where possible, but it's possible compile time vectorization works against the G80, like CSE did against the NV3x. It's also possible that TCP priorities and heuristics are programmable via the driver, and the haven't figured out the optimal balance yet. If you look at the CUDA material available to the public, it's clear there's more programmability in the system than DX10 models.

The other issue is synthetic tests don't neccessarily expose the strength of the system, just like a serial algorithm won't show the strength of Niagara, or a benchmark with scatter/gather memory access will show the benefits of the R580 memory controller vs G7x. The true advantages of the R300 didn't really show up until DX9 drivers and games arrived, the DX8 benchmarks made its performance look alot closer to the NV3x than it really was.

Unfortunately, we may have to wait for some time. (why doesn't gpubench ps3.0 dynamic branching test work on R5xx cards? If it's an issue with nv_fragment_program support, why not use glsl?) It will be interesting to see CUDA benchmarks later.
 
Back
Top