The Official NVIDIA G80 Architecture Thread

nAo, sorry I was a little bit too deep in car analogies I sometimes use. With “ALU injectionâ€￾ I mean the process of moving a thread from the big storage array to the ALU/FPU for execution. In the case of R520 we have such an “injectionâ€￾ IIRC every 4 clock.
State not stage. My mistake. We need at least two but I believe that there are more.
 
Wow this architecture is sweet :D. I third or fourth the request for a CUDA article ASAP.

Did NV disclose anything about how they've implemented context switching?

On to the speculation: Remember that paper by Stuart Oberman on interpolation (the one that talks about a unit that can do both special function evaluation and interpolation)? I think there are actually only 4 (full precision) interpolator/SF units:

the unit is basically computing Ax + By + C - when operating on quads, take the quad center xq, yq and deltas to the individual pixel centers (dx, dy). You want to compute
A*(xq + dx) + B*(yq + dy) + C for each pixel in a quad. Rewrite this (as in the presentation):

Code:
(A*xq + B*yq + C)     +     (A*dx + B*dy)
^^^^^^^^^^^^^^^^^           ^^^^^^^^^
once per quad,              4x per quad (greatly reduced precision multipliers)
this is the only unit
with enough precision in the multipliers to do the SF (see paper). 
This means you have 4 such units in the cluster, i.e. you need 4 cycles for 16 SFs.
So I'm willing to be that the 16 ALUs per cluster actually are still grouped into 4 quad arrangement for purposes of gradient calculations and more efficient interpolation. That also suits the texture address HW (which can probably take advantage of texture address coherency using similar techniques and shares LOD calculation across all pixels in a quad).
 
Last edited by a moderator:
psurge: I confirm the hardware still works on quads natively for pixel. If you look at the diagram you'll see I even sneaked that in, as well as the basic interpolation equation. hah! :) While your explanation is interesting, I think it's more likely it just needs 4 cycles because it does 4 iterations. I could be wrong on this though, and would love to have more info on the actual unit's implementation...


Uttar
 
Psurge, as far as I know the 16 ALUs will always process a 4*4 pixel block. I call it quad of quads or short a sedec (from sedecim lat. 16). This allows even future optimizations when it comes to interpolation (and texture address calculation)
 
I really hope it does not process 4x4 quads..it would be bloody inefficient.
Since there are 4 TAs per cluster I think it processes 4 independent quads
 
Just a question...
G80 coverage sampling AA ~= Parhelia Fragment AA ??????
- don't work on stencil shadow
- performance ~= 4X MSAA

bye
 
Psurge, as far as I know the 16 ALUs will always process a 4*4 pixel block. I call it quad of quads or short a sedec (from sedecim lat. 16). This allows even future optimizations when it comes to interpolation (and texture address calculation)
Are you sure of that? Our tests indicate that the inefficiency due to working on quads and not on pixels is roughly the same on G8x as it is on G7x, and the branching tests Rys did clearly seemed to indicate 16x2 as what the rasterizer tries to output, not 8x4. Of course, we could have done something horribly wrong, although I'll admit not to see what that could be... :)

Topman: Yes, you could compare some of its characteristics to Fragment AA, although the way it is implemented is completely different and it has quite different reactions to certain corner cases that I can see. Obviously, the biggest different between it and FAA for the end-user is that it's nearly never worse quality than 4x MSAA, while FAA would look as bad as if there was no AA at all when the algorithm "failed".

55%+ of my programming time for G80 was related to CSAA. So expect some totally awesome goodies related to it sooner rather than later in a followup article (not sure if we'll make it part of the IQ piece yet or not) - stay tuned.

Uttar
 
Uttar - IMO you're wrong because if you actually had 16 of the full precision units (i.e. able to output 16fp32 interpolants per clock), then the Ax + By + C computation is pipelined. Since SFs just take y = x*x and looks up the coefficients A,B,C based on x and SF from a LUT, you would be able to output 16 SF results per clock, hence my conclusion. (This is all assuming that it works the way described in the paper of course).

Here's the link to the slides and paper:
Edit: links are dead :(
http://arith17.polito.it/foils/11_2.pdf
http://arith17.polito.it/final/paper-164.pdf
 
Last edited by a moderator:
Are you sure of that? Our tests indicate that the inefficiency due to working on quads and not on pixels is the same on G8x as it is on G7x, and the branching tests Rys did clearly seemed to indicate 16x2 as what the rasterizer tries to output, not 8x4. Of course, we could have done something horribly wrong, although I'll admit not to see what that could be... :)

Uttar

No I am not sure. As I have no hardware by my own yet I can only repeat what I was told. They talk about a granularity of 16 pixels with 4*4 blocks.

I don’t know the shaders that you have used but I know from some past experiments that the ROPs can easily give you wrong results if you look for block sizes.
 
Demirug - interesting. that NV branching slide does talk about coherent 4x4 blocks...are you basing this off of info from NV or tests yuo've done?

How's this for a test:

send in long rectangles 4 pixels high. Branch based on screen y. If y mod 4 > 1, do a loop that does dependent texture reads (a really huge number of them), otherwise, do say half as many. This should give some idea of whether the rasterizer is submitting 2 pixel high blocks or 4 pixel high blocks as batches to the TPCs.
 
Last edited by a moderator:
nothing wrong with 4x4 blocks per se..but I'm drawing small triangles I don't think allocating 4x4 pixel quads would be a good idea ;)
 
Not sure where the right place for this is, but there's an interview with Jen-Hsun here:

I took some time to interview Nvidia CEO Jen-Hsun Huang about the company's latest graphics chip, a "general purpose GPU" that looks and smells suspiciously like a CPU. But it's really not an attempt to bump Intel aside. It's Nvidia's way to find new markets for the latent potential of the processing power of its PC graphics chips. Read on to hear more.


http://blogs.mercurynews.com/aei/2006/11/nvidia_launches.html
 
Jen-Hsun: Yes, we tried to get it out last year. But it was just too big. So it’s late by our own standards. In the end, it cost us 4 years. It was important to get it right. There were 600 man years total. We started with 10 people working on it and grew to 300 eventually.

My theory was correct!, this was to be the flagship and the g70,71 were just back ups. well not to mention the g70 would have been the refresh of the nv40
 
the hooded guy with the axe and the water demos are incredibly sweet. look very much like prerendered CG.

suppose it'll be 3-4 years before we see games with this level of detail.
 
To me, at the launch event, the Smoke Box demo was the most impressive

BTW, Jen claimed during launch that not only was it 4 years, 1000 man years, $600 million, but also that the project was conducted in heavy secrecacy.
 
To me, at the launch event, the Smoke Box demo was the most impressive

BTW, Jen claimed during launch that not only was it 4 years, 1000 man years, $600 million, but also that the project was conducted in heavy secrecacy.

Ha, good old Jen-Hsun, big day for him, and they earned it, but the numbers keep creeping up. He just told Takahashi 600 man years in a blog published today! :LOL:
 
from the video it seems the simulation grid is insanely fine grained..


Yes, projected at high-def resolutions on a massive screen, I was unable to see any particles larger than a pixel, nor any of the usage faked volumetric smoke rendering via layers. They claimed during the demo that it is a pure particle simulation. They also claimed during the geometry shader demo that the water system is particle based with per pixel collection detection done on the GPU in order to detect which surfaces to mark "wet"

There is a Frog demo they did too, using the GPU to perturb the frog's geometry. They slapped it around ala Black And White, and they grapped its skin and pulled it, letting it snap back into shape. So it was kinda like a "cloth sim" if the cloth was elastic.


Finally, they claimed the Adrianne demo used real-time subsurface scattering on the skin (not precalced).
 
Back
Top