NVIDIA Fermi: Architecture discussion

Being a bit silly for a moment

Thnx. Seeing die's at such resolutions is an eye-opening experience.
If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors :oops: which would take 129 Eyefinity 6 cards to run :p

Jawed
 
In general with lower tessellation factors and anisotropy, you'll get notably less than 2 triangles per extra vertex. Go count some tessellated patches' triangles and vertices if you don't believe me. I counted one tessellated patch with 146 triangles and 87 vertices, earlier ;)
That's because you're double counting the edge vertices. They get reused in adjacent patches. Okay, I'll agree that you can't cache the verts on every edge, so it'll be lightly lower than two, but that's it.

I don't remember hearing about face factor
Well it's there. SV_InsideTessFactor. There's nothing weird about it. You're right, that image is using the same tesselation factor for all edges on a mesh, so it's not going to give you any idea of what fancy tesselation structures look like.

AMD says that the architecture's comfort zone bottoms out at 8 fragments per clock, in effect. It's not "my scenario", it's how the hardware works.
It has nothing to do with comfort zone. If you have tiny triangles, all samples fit in one quad most of the time. If you can only feed one triangle every three clocks, then you will only get one quad into the shading engine every three clocks.

The rasteriser is the bottleneck on hardware thread generation, I presume: a new hardware thread can be started once every 4 cycles per group of 10 SIMDs - and only one SIMD can start a hardware thread in any 4 cycle window, when setup is exporting 1-pixel triangles.

TS, in this scenario, is producing triangles every 3 cycles. So the SIMDs can't go any faster. They're starved by the huge granularity of rasterisation and thread generation, not by lack of triangles.
You're confusing threads with quads or doing something else. One thread every four clocks is the max you will ever need because a rasterizer can only generate 4 quads per clock at most and a thread has 16 quads in it. If you have one pixel triangles being fed to each rasterizer once every six clocks (on average), then each rasterizer will only produce a quad once every six clocks and sit idle 83% of the time. The thread will not be full until it gets 16 quads, which means one thread every 96 clocks.

The rasterizers can handle six times the triangle throughput of the TS before they become a bottleneck. Before that, though, setup will be a bottleneck, but even that can handle 3x the speed of the TS.

You're missing the point, this architecture is designed for big triangles.
You're missing the point. This rasterizer can run at 6.25% efficiency with single pixel triangles instead of the 1% it does now due to limitations by the tesselator & setup. That's a big difference.

I'm talking about triangles entering rasterisation.
So you're saying 50M triangles before culling? No game does anywhere near that, so what's the point of this strawman example?

It says that the maximum triangles in any draw call is 1.6 million, coming out of TS.
So? I'm talking about triangle counts per frame in Heaven. Why do you care about count per call?

All academic for an architecture that likes big triangles, I'm afraid. This is completely the wrong workload.
Again, missing the point. What I just showed you is that the performance hit is entirely due to the 3-cycle per tri hit of the tessellator. There is no evidence that inefficiency of the "rasterisation/fragment shading/back-end" makes tessellator improvements pointless. Your argument is completely bunk.

ATI can chop the performance hit of tessellation by a factor of six without major architectural changes. All they have to do is make the tessellator generate triangles faster and double the speed of the triangle setup/culling.
 
If you have one pixel triangles being fed to each rasterizer once every six clocks (on average), then each rasterizer will only produce a quad once every six clocks and sit idle 83% of the time. The thread will not be full until it gets 16 quads, which means one thread every 96 clocks.

Is this finally confirmed? That the rasterizer will always produce full threads, possibly composed of multiple triangles?
 
Is this finally confirmed? That the rasterizer will always produce full threads, possibly composed of multiple triangles?
I would be shocked if this wasn't the case, as it would be a collossal loss of efficiency to save barely any silicon. I know for sure that Xenos did.
 
If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors :oops: which would take 129 Eyefinity 6 cards to run :p

Jawed

I think you would need a bit less of screens because every chip is composed from several layers containing transistors.

If Fermi is anything like modern CPU it will have between 9 and 11 layers.

Sorry for OT!
 
I think you would need a bit less of screens because every chip is composed from several layers containing transistors.

If Fermi is anything like modern CPU it will have between 9 and 11 layers.

Sorry for OT!
Those are metal layers, not transistor layers. Wiring all those transistors to each other is not an easy thing...
 
I would be shocked if this wasn't the case, as it would be a collossal loss of efficiency
Whereas I think this is precisely the black hole that increased TS throughput would be feeding, thus pointless ;)

I suspect addressing this problem is amongst the things that got deleted from Cypress.

to save barely any silicon.
Really?

I know for sure that Xenos did.
Care to share the source for that info?

Jawed
 
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. :) Try optimizing for both!
 
If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors :oops: which would take 129 Eyefinity 6 cards to run :p

Jawed
Perfect use case for Carell's Holographic hemisphere. :smile:
 
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. :) Try optimizing for both!

My guess would be diffraction.
 
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. :) Try optimizing for both!

take a bowl or bucket of water. Put a thin coat of oil on it. Voila!
 
Back
Top