NVIDIA Fermi: Architecture discussion

Mindfury · Mar 29, 2010

Oops....

rpg.314 · Mar 29, 2010

Mindfury said:
I saved it....

HUGE PIC

MS skydrive

It's throwing a 509 error for me.

Mindfury · Mar 29, 2010

rpg.314 said:
It's throwing a 509 error for me.

F**K MS...I need to find a new pic host..

ShaidarHaran · Mar 29, 2010

Try my link.

RecessionCone · Mar 29, 2010

ShaidarHaran said:
Try my link.

Thanks for the link, it worked for me (the other one didn't).

compres · Mar 29, 2010

Much respect to nVidia engineers. That thing is massive.

rpg.314 · Mar 29, 2010

ShaidarHaran said:
Try my link.

Thnx. Seeing die's at such resolutions is an eye-opening experience.

Jawed · Mar 29, 2010

Being a bit silly for a moment

rpg.314 said:
Thnx. Seeing die's at such resolutions is an eye-opening experience.

If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors

which would take 129 Eyefinity 6 cards to run

Jawed

Mintmaster · Mar 29, 2010

Jawed said:
In general with lower tessellation factors and anisotropy, you'll get notably less than 2 triangles per extra vertex. Go count some tessellated patches' triangles and vertices if you don't believe me. I counted one tessellated patch with 146 triangles and 87 vertices, earlier

That's because you're double counting the edge vertices. They get reused in adjacent patches. Okay, I'll agree that you can't cache the verts on every edge, so it'll be lightly lower than two, but that's it.

I don't remember hearing about face factor

Well it's there. SV_InsideTessFactor. There's nothing weird about it. You're right, that image is using the same tesselation factor for all edges on a mesh, so it's not going to give you any idea of what fancy tesselation structures look like.

AMD says that the architecture's comfort zone bottoms out at 8 fragments per clock, in effect. It's not "my scenario", it's how the hardware works.

It has nothing to do with comfort zone. If you have tiny triangles, all samples fit in one quad most of the time. If you can only feed one triangle every three clocks, then you will only get one quad into the shading engine every three clocks.

The rasteriser is the bottleneck on hardware thread generation, I presume: a new hardware thread can be started once every 4 cycles per group of 10 SIMDs - and only one SIMD can start a hardware thread in any 4 cycle window, when setup is exporting 1-pixel triangles.

TS, in this scenario, is producing triangles every 3 cycles. So the SIMDs can't go any faster. They're starved by the huge granularity of rasterisation and thread generation, not by lack of triangles.

You're confusing threads with quads or doing something else. One thread every four clocks is the max you will ever need because a rasterizer can only generate 4 quads per clock at most and a thread has 16 quads in it. If you have one pixel triangles being fed to each rasterizer once every six clocks (on average), then each rasterizer will only produce a quad once every six clocks and sit idle 83% of the time. The thread will not be full until it gets 16 quads, which means one thread every 96 clocks.

The rasterizers can handle six times the triangle throughput of the TS before they become a bottleneck. Before that, though, setup will be a bottleneck, but even that can handle 3x the speed of the TS.

You're missing the point, this architecture is designed for big triangles.

You're missing the point. This rasterizer can run at 6.25% efficiency with single pixel triangles instead of the 1% it does now due to limitations by the tesselator & setup. That's a big difference.

I'm talking about triangles entering rasterisation.

So you're saying 50M triangles before culling? No game does anywhere near that, so what's the point of this strawman example?

It says that the maximum triangles in any draw call is 1.6 million, coming out of TS.

So? I'm talking about triangle counts per frame in Heaven. Why do you care about count per call?

All academic for an architecture that likes big triangles, I'm afraid. This is completely the wrong workload.

Again, missing the point. What I just showed you is that the performance hit is entirely due to the 3-cycle per tri hit of the tessellator. There is no evidence that inefficiency of the "rasterisation/fragment shading/back-end" makes tessellator improvements pointless. Your argument is completely bunk.

ATI can chop the performance hit of tessellation by a factor of six without major architectural changes. All they have to do is make the tessellator generate triangles faster and double the speed of the triangle setup/culling.

Psycho · Mar 29, 2010

Mintmaster said:
If you have one pixel triangles being fed to each rasterizer once every six clocks (on average), then each rasterizer will only produce a quad once every six clocks and sit idle 83% of the time. The thread will not be full until it gets 16 quads, which means one thread every 96 clocks.

Is this finally confirmed? That the rasterizer will always produce full threads, possibly composed of multiple triangles?

Mintmaster · Mar 30, 2010

Psycho said:
Is this finally confirmed? That the rasterizer will always produce full threads, possibly composed of multiple triangles?

I would be shocked if this wasn't the case, as it would be a collossal loss of efficiency to save barely any silicon. I know for sure that Xenos did.

Lightman · Mar 30, 2010

Jawed said:
If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors which would take 129 Eyefinity 6 cards to run

Jawed

I think you would need a bit less of screens because every chip is composed from several layers containing transistors.

If Fermi is anything like modern CPU it will have between 9 and 11 layers.

Sorry for OT!

Mintmaster · Mar 30, 2010

Lightman said:
I think you would need a bit less of screens because every chip is composed from several layers containing transistors.

If Fermi is anything like modern CPU it will have between 9 and 11 layers.

Sorry for OT!

Those are metal layers, not transistor layers. Wiring all those transistors to each other is not an easy thing...

Jawed · Mar 30, 2010

Mintmaster said:
I would be shocked if this wasn't the case, as it would be a collossal loss of efficiency

Whereas I think this is precisely the black hole that increased TS throughput would be feeding, thus pointless

I suspect addressing this problem is amongst the things that got deleted from Cypress.

to save barely any silicon.

Really?

I know for sure that Xenos did.

Care to share the source for that info?

Jawed

DemoCoder · Mar 30, 2010

Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots.

Try optimizing for both!

jlippo · Mar 30, 2010

DemoCoder said:
The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. Try optimizing for both!

Isn't that why G100 is ~3b transistors?

Deadly Towers · Mar 30, 2010

DemoCoder said:
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction?

oblique lighting with filters?

rpg.314 · Mar 30, 2010

Jawed said:
If a pixel on a monitor corresponded with a "transistor", you'd need a grid of 22x35 2560x1600 monitors to see the transistors which would take 129 Eyefinity 6 cards to run

Jawed

Perfect use case for Carell's Holographic hemisphere. :smile:

rpg.314 · Mar 30, 2010

DemoCoder said:
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. Try optimizing for both!

My guess would be diffraction.

aaronspink · Mar 30, 2010

DemoCoder said:
Here's a curious question, why do different areas of the die have different colors? Is it false color, or diffraction? I would assume that it's diffraction, and that different areas of the chip might have different density of structures and/or regular vs irregular structure, but I'm curious if other factors could be involved, such as doping composition, metal layers, etc. The ultimate chip would be one that is simultaneously architecturally efficient, and produces beautiful die shots. Try optimizing for both!

take a bowl or bucket of water. Put a thin coat of oil on it. Voila!

NVIDIA Fermi: Architecture discussion

Mindfury

rpg.314

Mindfury

ShaidarHaran

hardware monkey

RecessionCone

compres

rpg.314

Jawed

Mintmaster

Psycho

Mintmaster

Lightman

Mintmaster

Jawed

DemoCoder

jlippo

Deadly Towers

rpg.314

rpg.314

aaronspink

Similar threads