NVIDIA Fermi: Architecture discussion

A film effect is plausible, but I'm not so sure you'd get the same pattern, especially the symmetry, and sharp edged transitions.

Well you also have to factor in material and height differences. The die photo in question likely was taken from either a mid-metal layer stop point or a de-layered die. Die photos of actual finished dies aren't that interesting cause all you see is the top metal layer which in a C4 design is basically a lot of square pads in a regular array.
 
In addition to that, isn't it also the case that the PR photos are "made pretty" with photoshop in addition to any special lighting or filtering that goes into taking the pictures?
 
That's because you're double counting the edge vertices. They get reused in adjacent patches. Okay, I'll agree that you can't cache the verts on every edge, so it'll be lightly lower than two, but that's it.
Ah yes, indeed, double-counting patch-edge vertices is my mistake.

Well it's there. SV_InsideTessFactor. There's nothing weird about it.
I thought you were trying to suggest that this was involved in making vertices on common edges align, but you were merely referring to another factor.

It has nothing to do with comfort zone. If you have tiny triangles, all samples fit in one quad most of the time. If you can only feed one triangle every three clocks, then you will only get one quad into the shading engine every three clocks.
You're assuming that the hardware can put multiple triangles into a hardware thread or even that multiple triangles can occupy a fragment-quad.

Going back to R300, ATI's architecture is based on small hardware threads. I'm still unclear on the actual count of fragments in R300's threads, whether it's 16 or 64 or 256 etc. But compare this with NV40 which we know has a monster hardware thread allocation across all pixel shading SIMDs, running into thousands for the simplest case of minimal register allocation per fragment.

Why, in this era, would ATI support multiple triangles per hardware thread, when hardware threads are small and when there's so few small triangles, ever?

R520 has a hardware thread size of 16. All the later high-end GPUs have grown this as an artefact of the ALU:TEX increases, so that we now stand at 64. For games in general and for moderate amounts of tessellation, 64 is fine, because average triangle sizes are large enough to occupy a significant portion of all pixel shading hardware threads.

But the point is, the basic architecture has been the same all along: SPI creates a thread of fragments' attributes at the rate of 1 or 2 attributes per fragment per clock, for a single triangle at a time.

Cypress deletes SPI and so some of SPI's responsibility for controlling register allocation/initiation has been dumped on the overall thread control unit. Now LDS has to be populated with barycentrics and attributes, for on-demand interpolation by fragments.

My theory is Cypress was due to have a multi-triangle, variable-throughput, thread allocator/LDS-populator. That, perhaps coupled with other changes to tessellation/setup/rasterisation, would have provided the tiny-triangle heft. But that was all dropped.

So you're saying 50M triangles before culling? No game does anywhere near that, so what's the point of this strawman example?
Since I believe that single-pixel triangle tessellation is a strawman, I wonder why I'm here, frankly.

Anyway, any decent adaptive tessellation routine will cull patches based on things like back-facing, viewport, occlusion querying, so that the your strawman of 50M triangles before culling is irrelevant. These approaches to tessellation will make it practical for multi-million rasterised triangles per frame.

Overdraw is always going to be a problem, even with a deferred renderer.

So? I'm talking about triangle counts per frame in Heaven. Why do you care about count per call?
The B3D graph shows that the longest draw call is ~28% of frame time. I don't know the frame rate at that time, but let's say it was 20fps, 50ms. Assuming 1.6 million triangles in 14ms (though that could have been 1M triangles in the same time, the article is very vague), that's 114M triangles per second coming out of TS.

Again, missing the point. What I just showed you is that the performance hit is entirely due to the 3-cycle per tri hit of the tessellator. There is no evidence that inefficiency of the "rasterisation/fragment shading/back-end" makes tessellator improvements pointless. Your argument is completely bunk.
On a close-up of the dragon the frame rate is about 45fps without tessellation. That implies a very substantial per-pixel workload - something like 260 cycles (2600 ALU cycles) per pixel assuming vertex load is negligible - and without knowing what proportion is full-screen passes. Anyway, with tessellation on, it doesn't require much of a drop in average fragments per triangle to kill pixel shading performance.

One of the factors here is we don't know what Heaven's doing per frame. Their site talks about advanced cloud rendering, for example. That'd be a workload that's unaffected by tessellation.

One of the things B3D could have done was to evaluate performance and draw call times with tessellation but no shadowing. I'm not sure how the night time portions of Heaven work, whether there's any shadowing involving tessellated geometry.

ATI can chop the performance hit of tessellation by a factor of six without major architectural changes. All they have to do is make the tessellator generate triangles faster and double the speed of the triangle setup/culling.
Ah yes, it was so easy and obvious, it's a feature of Cypress :rolleyes:

Jawed
 
Those are metal layers, not transistor layers. Wiring all those transistors to each other is not an easy thing...

I'm aware of that but I was under impression metal layers use some transistors for buffers and to amplify signals. Or are these in logic layer and wires need to go back to bottom in order to use them?
It's nice t learn new things on forum, so thank you in advance for answer!
 
Still trying to make sense of the ROP throughput...

One thing which is quite remarkable and I almost overlooked from Tridam's numbers is the ZSamples/sec throughput. Now, from G80 on nvidia chips always could in theory do 8 ZSamples/clock, though were never really close to their peak (for non-aa it was impossible anyway due to bandwidth limits).

But look at Tridam's numbers:
http://www.hardware.fr/articles/787-5/dossier-nvidia-geforce-gtx-480.html

The HD5870, which can do 4 ZSamples/clock, reaches almost its peak rate, but only at 8xAA. With no AA,, it is quite below that, apparently due to bandwidth restrictions. The GTX285 is not anywhere close to its theoretical peak, and for some odd reason I don't understand actually has its peak at no AA - on the upside though it is higher with no AA than on HD5870, suggesting in this case (as the bandwidth is almost the same) it has a bit better z buffer compression (maybe the z buffer compression ratio simply doesn't increase with higher AA?).
But look what happens with GTX480. Even though the numbers still are nowhere near their theoretical peaks (if we believe nvidia and rops run at 700Mhz, that would be 270GSamples/s), it reaches twice the throughput of the HD5870 with no AA, and still 1.5 times more with 8xAA, with only a little more memory bandwidth than both GTX285 and HD5870. Suggests to me that GF100 has much improved z buffer compression compared to previous chips, it still doesn't really seem to scale with increasing AA (hard to say though as the numbers at least increase a little and maybe the ROPs just can't reach their peak in any case) but almost twice as good with no AA than what HD5870 has.

Now the color fill results are an entirely different affair and most of them (if there really are 48 rops running at 700Mhz) simply pathetic, but it's at least something...
 
I'm aware of that but I was under impression metal layers use some transistors for buffers and to amplify signals. Or are these in logic layer and wires need to go back to bottom in order to use them?
It's nice t learn new things on forum, so thank you in advance for answer!

ALL transistors of every kind are in the logic layer.
 
In addition to that, isn't it also the case that the PR photos are "made pretty" with photoshop in addition to any special lighting or filtering that goes into taking the pictures?

it isn't unusual for them to apply various obfuscation filters to the image esp if the die photo is well before release.
 
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc.
As said, the photos are taken typically from some of the lower metal layers. The almost regular pitch between the lines there work as a diffraction grating. That means, different colors can be seen under different angles (as it is illumated with white light for the shots). An optical diffraction grating with a sub micron pitch (several thousands lines per mm) shows fairly similar coloring (but without the structures of course) when one looks with the bare eye on it. Or just take a CD or DVD! ;)
 
That was my intuition, I just think it's cool that, unlike a CD where the patterns are smeared out, presumably due to, due to error correction, data compression, and other distribution effects which randomize things (so most CDs have a uniform diffraction fringe effect), the chip cores actually have a lot more symmetry, in a way, almost organic. Maybe I'm the only one that sees the beauty and art in this, but I find these die shots pleasing to the eye, on both the color and geometric pattern levels.
 
But look what happens with GTX480. Even though the numbers still are nowhere near their theoretical peaks (if we believe nvidia and rops run at 700Mhz, that would be 270GSamples/s), it reaches twice the throughput of the HD5870 with no AA, and still 1.5 times more with 8xAA, with only a little more memory bandwidth than both GTX285 and HD5870..
Z-sample rate for a 700-MHz-part should translate to 179.2 GSamples/s. They said pretty clearly they could do up to 256 zpc if data's compressible.

http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/
" For example in 8xAA, the peak GPC output rate is 32*8 = 256 samples per clk, whereas the peak ROP rate is 48 samples per clk."

Now the color fill results are an entirely different affair and most of them (if there really are 48 rops running at 700Mhz) simply pathetic, but it's at least something...
No, they aren't. Color fill is 30*700=21 GPixe/s for the GTX 480 for example, since it's not the ROPs any more that are limiting performance but the bottleneck is ealier in the pipe. That's a move away from standard routes, granted.
 
Last edited by a moderator:
No, they aren't. Color fill is 30*700=21 GPixe/s for the currently announced parts, since it's not the ROPs any more that are limiting performance but the bottleneck is ealier in the pipe. That's a move away from standard routes, granted.
And why does a GTX480 only 10 GPixel/s with the RGB9e5 or the FP16 color format? I don't see where this could be limited earlier in the pipeline.
 
Half-Rate? You cannot split individual pixels to different ROPs, and each ROP takes two cycles to process that (and FWIW FP16) format.
 
Half-Rate? You cannot split individual pixels to different ROPs, and each ROP takes two cycles to process that (and FWIW FP16) format.
You wouldn't have to split individual pixels. A GTX480 can spit out 30 pixels per clock (at 700 MHz) to 48 ROPs. I'm quite sure that on average they get evenly distributed to the ROPs (otherwise 48 ROPs would have zero benefit compared to only 30). So half rate for those formats in the ROPs would also imply the ROPs are limiting it to about 10 GPixel/s. Half rate for 48 ROPs @ 700MHz would yield you 16 GPixel/s.

That reminds me of the strange rumors in january (I think Xman was talking about that) involving a core clock of only 450 MHz or something like that. And if you look at the numbers today, it appears to fit, or is there anything speaking against it besides nvidia's own word?
 
Last edited by a moderator:
You wouldn't have to split individual pixels. A GTX480 can spit out 30 pixels per clock (at 700 MHz) to 48 ROPs. I'm quite sure that on average they get evenly distributed to the ROPs (otherwise 48 ROPs would have zero benefit compared to only 30). So half rate for those formats in the ROPs would also imply the ROPs are limiting it to about 10 GPixel/s. Half rate for 48 ROPs @ 700MHz would yield you 16 GPixel/s.

That reminds me of the strange rumors in january (I think Xman was talking about that) involving a core clock of only 450 MHz or something like that. And if you look at the numbers today, it appears to fit, or is there anything speaking against it besides nvidia's own word?

Right, it would fit. My mistake - i was thinking of 18 ROPs running idle after a while, not taking into account that the shader engine doesn't have to be half-rate.
 
Is that 30 number from the interview? How do you lose 2 pixels due to the disabled SM? Rasterization is supposed to be a GPC level function. I don't get it. :???:
Maybe a SM can only export two pixels per clock to the ROPs?

Btw., I still think there are no such things as GPCs ;)
 
You wouldn't have to split individual pixels. A GTX480 can spit out 30 pixels per clock (at 700 MHz) to 48 ROPs. I'm quite sure that on average they get evenly distributed to the ROPs (otherwise 48 ROPs would have zero benefit compared to only 30). So half rate for those formats in the ROPs would also imply the ROPs are limiting it to about 10 GPixel/s. Half rate for 48 ROPs @ 700MHz would yield you 16 GPixel/s.

That reminds me of the strange rumors in january (I think Xman was talking about that) involving a core clock of only 450 MHz or something like that. And if you look at the numbers today, it appears to fit, or is there anything speaking against it besides nvidia's own word?

Yeah we were talking about this before, I still can't make any sense of it. There are 48 rops but they either act like 48 rops at ~450Mhz (but nvidia claims they have 700Mhz clock, and OC core also increases ROP throughput, though a fixed divider might be a possibility) or like 32 rops at ~700Mhz (which is the same essentially). CarstenS is right though about 256 zpc max since even for 8xAA this is single cycle hence limited by rasterization limit. And since nvidia claimed that 32 pixel throughput limit goes down with disabled SMs that would make it only 240 zpc - hence the measured rate would indeed be close to 100% theoretical rate. So maybe nvidia has killer compression for high AA levels too it might just not show up in this test, but in any case compression really looks good even with no AA.

CarstenS said:
Right, it would fit. My mistake - i was thinking of 18 ROPs running idle after a while, not taking into account that the shader engine doesn't have to be half-rate.
That's what I was trying to tell you in about a dozen posts :)
 
Yeah we were talking about this before, I still can't make any sense of it. There are 48 rops but they either act like 48 rops at ~450Mhz (but nvidia claims they have 700Mhz clock,

Wacky idea: maybe not 450MHz but 462MHz. This would be 924MHz/2 ??
 
Back
Top