The Official NVIDIA G80 Architecture Thread

OK, so is it safe to assume that the 32-fragment batch size in G80 is a direct result of the setup/rasteriser engine?

That's what I got out of my first read, and it seems puzzling to me. Clearly the architecture can perform 16-sized batches, and it isn't clear to me what's being saved between the rasterizer and the SIMD-cluster (or whatever the appropriate nomenclature is), or, alternatively, what else is being optimized by choosing 32 over 16. If anyone has any insight there, I'd appreciate a pointer.
 
Last edited by a moderator:
Coverage sample AA dosent seem to do all that much. Im more excited about the new true 8xMSAA mode (8xQ).

8800GTS seems to be a sweet deal for overclockers:
http://www.firingsquad.com/hardware/nvidia_geforce_8800_preview/page23.asp

Im really impressed with G80.

The only thing i really miss , is a SS tranparency AA performance option with 1/2 or 2/4 the samples. I might getting a gts but in some games having 8x SS transparency AA might be too much. 8x/16x Edge plus 4x or 6x transparency SS AA would be awesome
 
Last edited:
I'd be interested to find out more about this coverage sample AA. I haven't yet seen it adequately described.
 
Are the G80 scalar interpolator and special function ALUs the ones described in this paper:

"A High-Performance Area-Efficient Multifunction Interpolator"
http://portal.acm.org/citation.cfm?id=1078021.1078076

I know this paper was brought up at some point on this board, but was it discussed in the context of G80? Did we know they were going to use these in G80? Probably not a surprise I guess.
 
Topman: Yes, you could compare some of its characteristics to Fragment AA, although the way it is implemented is completely different and it has quite different reactions to certain corner cases that I can see. Obviously, the biggest different between it and FAA for the end-user is that it's nearly never worse quality than 4x MSAA, while FAA would look as bad as if there was no AA at all when the algorithm "failed".
Only thing I've read so far about CSAA was at Anandtech, but my impression was it calculates coverage for 16x, but drops back to 4x under certain conditions. This drop can be seen in some of the images. In general FAA handled overflows in a different fashion, but the interesting part is stencil. In Parhelia 8x FAA aliased a pixel if it hit a stencil test to avoid potentially drawing the wrong color. CSAA seems to do this as well only it has a better fallback, 4x AA.
 
It certainly appears to be. I admit to not having considered the number of these that would be present. What in my mind was a simple x# of MAD units has turned out to be, instead, a MAD + MUL + interpolator++ bag of functional units.

Imagine is heavily into the whole SIMD thing with VLIW constructs across asymmetric ALUs, which is different from what we're looking at.

Hah!
 
Only thing I've read so far about CSAA was at Anandtech, but my impression was it calculates coverage for 16x, but drops back to 4x under certain conditions. This drop can be seen in some of the images. In general FAA handled overflows in a different fashion, but the interesting part is stencil. In Parhelia 8x FAA aliased a pixel if it hit a stencil test to avoid potentially drawing the wrong color. CSAA seems to do this as well only it has a better fallback, 4x AA.

It's mostly really thin geometry objects that suffer with CSAA, Usually these issues are not all that distracting in real time. The larger the triangle the more effective it seems to be. I pretty much covered these issues in the investigation in my signature if your curious.
 
It's mostly really thin geometry objects that suffer with CSAA, Usually these issues are not all that distracting in real time. The larger the triangle the more effective it seems to be. I pretty much covered these issues in the investigation in my signature if your curious.
I had already bookmarked your writeup, but will have to wait until tomorrow to check it out. :smile:
 
Sure! CSAA is pretty good. But I am also glad Nvidia also updated their traditional multisampling algorithim for those hardcore IQ enthusiasts. 8xQ is frankly awesome quality for an 8x jittered grid multisampling aproach. And is an excellent when tied in within 16x CSAA. Its about as no compromise as you can come. But when you consider the performance hit of 16xAA ((usually about 10% from 4xAA)) you really gotta be impressed with its performance and quality tradeoffs IMO. Considering 4xAA is now single cycle on the G80 I am now considering 4xAA the new 2xAA mode. And 16xAA the new 4x mode that most G80 users will likely opt to use. Even though its slightly a bit more complicated than that. :)
 
Thinking about it, I may have just missed a trick in my shader when testing the raster pattern. The hardware might well raster any of 16x2, 8x4, 4x8 and 2x16, depending on the triangles.
 
Thinking about it, I may have just missed a trick in my shader when testing the raster pattern. The hardware might well raster any of 16x2, 8x4, 4x8 and 2x16, depending on the triangles.
I think that's the case, I also believe that it can just rasterize 4 2x2 quads coming from different primitives not sharing any edge between them, you really don't want to run at 25% efficiency if your triangles are small :)
 
I think that's the case, I also believe that it can just rasterize 4 2x2 quads coming from different primitives not sharing any edge between them, you really don't want to run at 25% efficiency if your triangles are small :)
Yeah, that was probably a given based on the fact our "small triangles to test quad efficiency" tests were giving roughly the same results on G7x and G8x. I guess we definitely will want to run some more raster pattern tests, eventually... :)

Uttar
 
Yeah, that was probably a given based on the fact our "small triangles to test quad efficiency" tests were giving roughly the same results on G7x and G8x. I guess we definitely will want to run some more raster pattern tests, eventually... :)

Uttar

If I am allowed to make a suggestion.

Use one large triangle over the whole screen. Let the shader sample from different textures that contains the test patterns. Branch based on the sampled value.
 
Use one large triangle over the whole screen. Let the shader sample from different textures that contains the test patterns. Branch based on the sampled value.
That is actually exactly what Rys did :) We didn't have the time to do extensive geometry-based testing, on the other hand, I think. So what we really determined is the hardware's preferred rasterization pattern at a low-level when the primitive is more than width enough. As Rys (and nAo) said above, the hardware is probably capable of much less simple patterns when it has to, in order to keep the shader core used as efficiently as possible. Efficiency in realworld situations would be nothing to write home about, otherwise!

And here's the relevant quote from the article:
The raster pattern is observed to be completely different when compared to any programmable shader NVIDIA has built before, at least as far as our tests are able to measure. We engineered a pixel shader that branches depending on pixel colour, sampling a full-screen textured quad. Adjusting the texture pattern for blocks of colour and measuring branching performance let us gather data on how the hardware is rasterising, and also let us test branch performance at the same time. Initially working with square tiles, it's apparent that pixel blocks of 16x2 are as performant as any other, therefore we surmise that a full G80 attempts to rasterise screen tiles 16x2 pixels in size (8 2x2 quads) before passing them down to the shader hardware.

Uttar
 
Hello,

Thanks for a kick-ass review. I'm looking forward for the other parts.

I have a question regarding the tests you have run. Will you make the results and/or programs aviliable for download?
 
Trilinear filtering perf?

(This was posted as a separate thread, but should really belong here, sorry for that..)

The application I am working on performs huge amounts of volume texture lookups.
Think millions of on-the-fly volume gradients, each requiring up to 7 volume texture lookups, each with trilinear interpolation.

So my question is what can be expected of G80 performance in this field?
Where does trilinear interpolation of L16 or L8 textures fit within the specified Int8/FP16 etc. bilinears?
Any ideas on how the introduced asynch sampling and per-cluster vs. global sampler arrays might affect this case?
Any hopes of increased speed due to different cache approaches?
 
nothing wrong with 4x4 blocks per se..but I'm drawing small triangles I don't think allocating 4x4 pixel quads would be a good idea ;)
I don't think 4x4 blocks necessarily means bad performance on small triangles. It depends on how you do things. For example, Xenos allows the pixels in a vector to come from different triangles. G80 could be similar, simply with an additional constraint that all pixels in a batch come from the same 8x4 block (or 16x2 block). For all we know Xenos could have the same constraint on its 8x8 tiles. The individual quads in each batch could even cover the same 4 pixels but with different coverage masks, as would often be the case when drawing small triangles. Given the insane z-rejection rate of G80 (and all recent ATI cards for that matter), pixels are divided in parallel this way anyway. Each pixel grouper would simply wait until it was either full of pixels or until a triangle came that didn't fit into any of the existing pools, so the fullest group would be dispatched.

The only time such a scheme would hurt you is if you had a strip of triangles straddling two tiles, and they were long enough that the aforementioned duplication didn't fill up the tiles, AND a bunch of triangles sent before and after the strip did not land in the same tiles. That doesn't strike me as a particularly common case, so there's no real reason to have a crossbar that pools quads from different tiles.
 
Last edited by a moderator:
freka586, you should get double the performance on G80 as G71. In shadermark, there are a couple tests that are volume texture bound (as witnessed by the lack of improvement between R520 and R580), and they do indeed double on G80. G80 has more cache too, so that always helps, assuming lack of it is a problem on whatever chip you're using right now.

One question I have, though, is whether textures with less than 4 channels get filtered quicker (e.g. Does R16G16 get free 2xAF just like RGBA8?). I see no reason why they wouldn't, but I guess you never really know. In your case, I would hope that a L16 trilinear volume lookup would be performed 32 times per clock.
 
Back
Top