PDA

View Full Version : NVIDIA G80: Architectural Overview


Geo
08-Nov-2006, 18:59
<a href="http://www.beyond3d.com/reviews/nvidia/g80-arch/"><img border="1" src="http://www.beyond3d.com/reviews/nvidia/g80-arch/focus.png" align="right" width="75" height="75"></a>Four years and 400 million dollars in the making, NVIDIA G80 represents for the company their first brand new architecture with arguably no strong ties to anything they've ever built before. Almost entirely brand new as far as 3D functions are concerned, and designed as the flagship of their 8-series GeForce product line, their new architecture is squarely a D3D10 part but with serious D3D9 performance and image quality considerations.

<a href="http://www.beyond3d.com/reviews/nvidia/g80-arch/">Click here to read our architectural overview of NVIDIA's G80-based GeForce 8800GTX.</a>

Geo
08-Nov-2006, 19:31
Well, I guess I'll pick this thread to say major kudos to NVIDIA for an awesome chip from all indications at this point. I could french David Kirk for that AF pattern, so if you see me coming David, better run!

It's got performance, its got IQ, it's got a new AA, the price doesn't ratchet up previous highs which is not an inconsiderable good considering the memory size and chip size. It's got another 50% bus-width. Power wise its only a bit more than X19xx. Pinch me, I'm in love.

Okay, the board is bloody huge, and that's going to cause some heartaches. I truly hope folks will measure their own cases before hitting the order button to be sure.

And, there are some unknowns. We don't have a DX10 driver, so I think anyone who claims to know what that is going to show is blowing smoke. Having said that, one should not necessarily get bent out of shape by that at this point. R300 did not ship with a DX9 driver on launch day either.

I feel pretty certain this architecture is radically enough different that drivers are going to be a bit uneven for awhile. That's the bad news. Good news is there should also be some headroom headed our way as well.

Anywho, wtg NV, I'd be happy to stick one of these bad boys in my PCIe slot.

LeStoffer
08-Nov-2006, 20:19
A most outstanding job! And thanks especially for testing the G80 for any architectural weaknesses - this is what beyond3d is for. :smile:

Andrew Lauritzen
08-Nov-2006, 20:35
Amazingly well-done article, as usual from B3D :) Can't wait for the other ones!

And I cannot thank NVIDIA enough for orthogonal filtering, etc. Good riddance to those ridiculous texture format/feature matrices!

BByte
08-Nov-2006, 20:39
Excellent article indeed, lots of info that would be impossible to find elsewhere. Going even beyond the previous strengths of the site was welcome and even a bit surprising. I'm sure the architecture article will remain a good reference on G80 in the future as well. I’ll certainly try to digest it more thoroughly later on.

Looking forward to the follow up articles as well, it seems G80 is a pretty mighty chip.

trinibwoy
08-Nov-2006, 20:39
Informative, thorough and upbeat overview !! Glad to hear that we'll be seeing cross-IHV comparisons and related thoughts from the B3D crew. Good job guys ! :smile: Can't wait for the rest to come.

Osamar
08-Nov-2006, 21:01
I had to re-read some part two or three times, because the analisys has too much level for me :sad: . But these are the ways to learn more :grin: .

Fabulantastic job!!!!!! :razz: :shock: Thank all you very much for your work.

[maven]
08-Nov-2006, 21:21
Thanks for article, but there a few bits where I would hope to offer some (constructive) criticism.
I like the introductory note about the way B3D "reviews" are changing, but I do think that the self-promotion uttered therein is definitely not necessary.
The times that I've read a variation of "most likely", "estimated guess", "assume" or "speculate" in this article makes me unhappy; it serves to show how complex that piece of hardware is, but I'd prefer less weasel-words, even if exact details aren't available.
Throwing the space-filling curves (http://www.maven.de/2005/12/hilbert-curve/) in there serves no purpose IMO, either you know what is happening or not. I'm not reading B3D for throwaway comments.

Nonetheless, thanks for an interesting article. :)

Arun
08-Nov-2006, 22:18
Maven: We do what we can with the time we have... :) Some valid points you got there definitely, although I do disagree with several of them overall. Anyway, I didn't want to discuss that in-depth too much publicly, so check your PMs!

Uttar

Mintmaster
08-Nov-2006, 22:52
Excellent article, thanks for doing as much digging as you did.

A few little things. Regarding the filtering:
So that's 64 pixels per clock (ppc) of INT8 bilinear 2xAF, or 32ppc of FP16 bilinear, or 16ppc of FP16 2xAF, or 16ppc of FP32 bilinear per cycle
Isn't it supposed to be 32 ppc for INT8 bilinear 2xAF? I assume you can only execute 4 texture instructions peak per cluster per clock. (Side question: the address/LOD and filtering units run at 575MHz, right?)

Also, what's up with the texture fillrate in the table? It seems like you did 16*575 = 9200 MTexels/s. Shouldn't it be at least 32 per clock and arguably 64? It is, after all, fetching and filtering 4 texels from each of 64 different memory locations per clock, even if they're really 32 pairs of related locations.

Finally, a request: Can you do vertex fetch tests? I want to know how if the texture latency is truly hidden in vertex work. A simple procedure would be to have x,y,z,s,t data, and run two tests. In one just add the texture coordinates to position before transforming, and in the other, sample the texture and add it to position before transforming. This way both tests have the same math as well as input/output sizes.

Or have you already given back the board? I know pocketmoon66 said that VTF speed is 10x faster than a 6600, but I was sort of expecting even more (around 100x).

Geo
08-Nov-2006, 22:58
The current content system has some pecularities that caused some of the table values to be that way. We'll be addressing it in the future. Essentially, fill some values in and others are calculated, but if the architectural assumptions of the proggie are not true. . . well, let's just say that G80 is so revolutionary it broke our specs table data entry program. :lol:

Pete
08-Nov-2006, 23:04
Freakin' incredible! And great article, Rys. NV's been nothing but impressive since, well, everyone-knows-what, all the more so that they still appear to be on the ascent. I can't wait to see this beast on 65nm.

I'm impressed with the granularity of this USA. Not only is it down to scalar SPs, but apparently it can mix and match VS, GS, and PS right down to the SP, not just the clusters. And the AA and the AF! :eek: It's just incredibly fast, for all its advances. NV's got much to be proud of, and it's nice to read about it here in such detail.

Arun
08-Nov-2006, 23:07
Isn't it supposed to be 32 ppc for INT8 bilinear 2xAF? I assume you can only execute 4 texture instructions peak per cluster per clock.Oopsy, yup. Typos ftw. Rys is asleep now, so he'll correct that tommorow... :)
(Side question: the address/LOD and filtering units run at 575MHz, right?)Yup, you can also check the diagrams (either mine or Rys) to get a good idea of what's running at what clockrate.
Also, what's up with the texture fillrate in the table?When anyone on staff refers to how much they'd kill to have the new content management system live already, it's things like that we're pointing at. That shit is semi-automated in the current framework, and in a really bad way to make things worse, so it's basically impossible to fix it without borking other stuff. Maybe it'll get fixed up eventually, but no promise, sadly. For now, if NVIDIA asks, just say that we loved their focus on FP32 so much that we decided to present FP32 filtering figures, okay? ;)
Finally, a request: Can you do vertex fetch tests?Will do eventually. I just haven't had the time to get that damn VS testing framework to behave, sigh. And nope, we still got the board. Or rather, I never had one and will hopefully get one soon enough, and Rys still has his.
I know pocketmoon66 said that VTF speed is 10x faster than a 6600, but I was sort of expecting even more (around 100x).I'm not sure, but I think that kind of speedup might be possible for something like RGBA FP32 unfiltered data that'd rarely miss the cache. The G7x's VS TMUs could be worse for such a thing, after all. Although even then, I'd rather have expected 20-25x.

We'll get around to testing all that stuff eventually, don't worry - just don't hope for an answer as soon as tommorow though, hehe. And thanks for the kind words on the article - Rys did a kickass job on it imo, and I really can't insist on that enough considering how much of an annoying bastard I've been in giving him criticism for some of the earlier versions of it!


Uttar

Mintmaster
08-Nov-2006, 23:15
I see. So geo, was the 64 ppc thing a typo/mistake? Also, it seems like "pixels per clock" is a poor choice of words, but I can understand that it's a bit hard to decribe it. How about texture instructions per clock?

By the way, how did you guys get the die size? Just asked nicely for it? It's freakin' huge, though. With the gobs of money flowing thier way with the tiny G71 and G73, I never thought they'd go this route.

EDIT: Sorry, didn't see Uttars post when I replied.

Geo
08-Nov-2006, 23:18
Uttar is indicating that one was a mistake. I was thinking of the texture rate, which was calculated and unable to be changed.

Mintmaster
08-Nov-2006, 23:30
Uttar and Rys, thanks for going so detailed into this stuff. VS numbers will be greatly appreciated when you get around to it.
I'm not sure, but I think that kind of speedup might be possible for something like RGBA FP32 unfiltered data that'd rarely miss the cache. The G7x's VS TMUs could be worse for such a thing, after all. Although even then, I'd rather have expected 20-25x.
Well I think NV40 was reported to do ~20 million VTF's per second, and often slower. (Someone on these board said he's measured up to 200 cycles latency.) It's not the cache that's the problem, it the time in flight for vertices. They make a texture request, but it doesn't get back for 100's of cycles. Pixels are in a huge FIFO (i.e. the register space) to handle this, but since so few programs use VTF and vertex processing needs a lot of data in flight, it's FIFO is only big enough to cover the arithmetic logic latency.

Since G80 is unified, it could theoretically do 20 billion VTF's per second. So 100x is actually lowballing it. pocketmoon66's test could easily become limited by other things, though, once VTF gets a lot faster, so it doesn't prove that G80's VTF speed is 10x faster.

Jawed
09-Nov-2006, 00:31
Yeah looks like a stonking article and I'm looking forward to reading it properly and the other stuff to come.

Great work guys, it actually seems like quite a geek party has been had, cross-webby and all. Bet you've had (are having) a ball!

Jawed

poopypoo
09-Nov-2006, 01:08
You'll see AMD and NVIDIA GPUs go toe-to-toe, directly in the same piece

goddamn that's going to look weird to me for a long time! :shock:

obviously I'm only on the first page, but woo, I'll hand out kudos just for that bombshell of an announcement! <3

Skrying
09-Nov-2006, 01:26
Very impressive, while some of it does fly right over my head I really like the specific testing done. Very nice!

Pete
09-Nov-2006, 02:36
I wasn't the only one to think about TMU ("data fetch and filtering" / "sampling") usage in the last speculation thread:

It's also interesting to think of G80 "losing" some of its impressive # of TMUs in DX9 games when a TCP is processing vertices. I mean, current DX9 games don't use the TMU much with vertex shaders, right?

But apparently each cluster can work on more than one object type per clock. So I guess that means the TMUs pretty much stay as used as the code calls for, rather than as the arbiter assigns (meaning, no overly-idle TMUs if an entire cluster chews thru DX9 VS code)?

BTW--and here I'm reaching even further beyond my understanding and almost certainly beyond usefulness--what's the instruction window for G80's "global thread scheduler?" I mean, does it reorder or group incoming commands to optimize SP/TMU usage, or do G80's scalar SPs and ability to work on multiple data types per cluster obviate the need for any reordering?

compres
09-Nov-2006, 02:47
I don't spect anything less from beyond3d. One of the reasons I am still around here, even though I mostly lurk lol.

Jawed
09-Nov-2006, 03:12
Does Arun Demeure post on B3D, by the way?

Jawed

Geo
09-Nov-2006, 03:34
Does Arun Demeure post on B3D, by the way?


= Uttar

CMAN
09-Nov-2006, 04:04
Really good article! I browsed over it at work, but was finally able to sit down and read it at home.

While I never felt there could be more in depth analysis at B3D, I think the new team has done it. I will admit felt a little strange reading the articles "different" style at first. After reading it a second time just now, I like it a lot. I also like how you connect the dots a little closer in the new style. Keep up the great work in the next articles. I assume they'll be out tomorrow morning. :twisted:

Jakob
09-Nov-2006, 07:36
Great article. I know the amount of work that must have gone into it. You guys rock!

1. Texture fill rate should be 32*575 == 18400M texels/sec, not 9200 as listed.

2. For setup rate, it would be useful to measure clocks/tri for front facing, back facing, and zero area, separately.

3. Where are the vertex perfomance measures? There should be a HUGE performance improvement in vertex shader perf due to unification. Draw a million zero area triangles with a complex program that affects the color of the vertex. For example 100 muls feeding into the VS color out.

4. Similarly what about vertex attribute fetch from vertex buffers. Is this being done thru texture units, or a separate unit? What's the perf.

5. In general I would like to see more details about the results of your custom shader programs. (We ran this shader and got this many gpix/sec, then we added another mul instruction here, and got.... ) This goes for branching perf too...

Overall, woohoo to NV and B3D! :)

Unknown Soldier
09-Nov-2006, 09:02
Just started Reading. Just something i've picked up.

"NVIDIA realising that rear-exit would definitely cause GeForce 8800 GTX to not fit in ever more cases than it's able to in the shipping configuration."

Page 3 below the pics, wording is not right. Might want to look into it. ;)

US

_xxx_
09-Nov-2006, 09:14
Wow, lots of stuff happened in the week I was AWOL. Great article, thanks guys! Now I only need some time to cheqw through all the new info...

nutball
09-Nov-2006, 10:41
Man, they really blew the doors off with this new architecture didn't they? The thing that strikes me is the orthogonality of the filtering & AA features. Awesome. It's great to see such a revolutionary change in architecture from NVIDIA, keeps things interesting!

I like the idea of the proposed new structure to B3D articles too. Can't wait for the IQ assessment.

PeterAce
09-Nov-2006, 11:42
Fantasticly great article.

It answered lots of the floating niggles about G80 that really have buzzed around (in my head) in the last few weeks ;)

Extra big thanks to Rys and Uttar and the rest of the B3D Team.

Looking foward to the next installments.

fellix
09-Nov-2006, 12:53
As for VTF performance, 3DM'06 particle test is showing some considerable boost -- near eight times over the G71's dedicated vertex fetch units:

http://www.techreport.com/reviews/2006q4/geforce-8800/gpu-3dm-particles.gif
Although, it may not look so impressive compared to the theoretical numbers expected.

Evildeus
09-Nov-2006, 15:38
Great article!

Arun
09-Nov-2006, 18:40
how did you guys get the die size? Just asked nicely for it? It's freakin' huge, though. With the gobs of money flowing thier way with the tiny G71 and G73, I never thought they'd go this route.Forgot to reply to that one earlier, so here we go. A bunch calculations based on 118 dies/wafer can give you that ballpark, but obviously the easiest way to get an accurate figure is with a high-res wafer shot. We've got one somewhere I think, but it's not uploaded, so I'll just link to another site's wafer shot: http://www.hothardware.com/image_popup.cfm?image=big_g80_wafer.jpg&articleid=903&t=a

Calculating a bit based on the basis of a 300mm diagonal gives you 21.3*22.4mm or so, which is just short of 480mm2. I also did my best not to include the spacing between the dies that's visible in there, since obviously it isn't really part of the chip. But it'd become even more ridiculous huge if I included it in there. Hopefully that kind of precision is good enough for you ;)


Uttar

Quitch
10-Nov-2006, 01:21
As a layman, let me just request that you always link from inside the article (at the end) to the thread.

Geo
10-Nov-2006, 01:23
As a layman, let me just request that you always link from inside the article (at the end) to the thread.

You mean, like, say:

Find our discussion thread on G80 now it's officially public knowledge (and what a complete joke that's been), here.

And that "here" being. . .well. . .here?

Quitch
10-Nov-2006, 01:25
Guess my eyes glazed over towards the end :)

Pete
10-Nov-2006, 02:07
I guess this belongs in the G90 thread, but no way I'm opening that right now. The NYT reported (http://www.nytimes.com/2006/11/09/technology/09chip.html) that NV's moving to FP64 with their next gen. No big surprise, I guess, given D3D 10.1's ">FP32," but given that FP32 is so new that it hasn't been used in games yet, I thought it was interesting.

And the next generation of the 8800, scheduled to arrive in about a year, will have “double precision” mathematical capabilities that will make it a more direct competitor to today’s supercomputers for many applications.

Dunno if it hints at future bandwidth or math:otherstuff ratios. I guess it'd be somewhat interesting to compare R300 to G80 just to see how much individual aspects of GPUs have multiplied (transistors, shader/texture/ROP power, bus width, bandwidth).

Jawed
10-Nov-2006, 02:35
Gotta feel sorry for double-precision Cell at the prospect of "G90".

It's also entertaining to compare the GFLOPs in G80 alone against PS3's Cell+RSX. And, ahem, PS3 launches in a week...

Jawed

DudeMiester
10-Nov-2006, 03:38
Excellent article apart from an odd grammatical thing here and there. I have a few questions and comments though:

A) If you could elaborate on what "granularity" means in terms of the threading/shader core. As I understand, it's the number of pixels/verticies/objects assigned per thread. However, if this is say 32 pixels, then that would more then consume the available resources of processor unit (16 SP/16 Special). They can't be using the SPs and the Specials at the same time, as that would imply the calculation of a different instruction and thus seperate threads. This is unless the G80 also supports OoO execution, which I highly doubt. Although with compiler guidence ala a VLIW architecture, this may not be impossible, and note that NV30 was a VLIW design.

Given that there are 4 addressing/sampling units in each processor, imho it would make more sense to have groupings of 8 objects each regardless of thread type, which is 2 quads for pixels. Thus, each thread has a dedicated address unit, 2 occupy the SPs and 2 occupy the Specials. Of course, this is with further threads occupying the other stages of each unit.

Then again, I may be mis-interpreting the entire point, and please explain if I am.

B) You mentioned in your article that it's unclear if the G80 has dedicated constant buffers. Imho, I think it's only natural that CBs be treated as 1D textures, as such textures are already used in this manner today. Having 1D textures be treated the same as CBs, gaining caching priority, would thus make both DX9 and DX10 games performant.

C) Finally, given that NVIO is off die and the strong focus on IQ in this architecture, I have to wonder if the chip has it's own framebuffer cache. This would transparently enable triple buffering, making vsync obsolete. I have a feeling this is not the case, but it would be nice. At the very least, I imagine it will streamline the performance of it.

PS: I'm seeing a lot of marketing-speak about the shader core's supposed 100% utilisation rate. Any truth to this?

Rys
10-Nov-2006, 08:52
Excellent article apart from an odd grammatical thing here and there. I have a few questions and comments though:
Not my best work, writing wise. But that's what I get for setting myself up to write it very much last minute.

a) The simple answer is you don't always need interpolated values as input to your shader, and even if you do, sometimes it's not needed until late enough in the shader so as to hide calculation latency. But of course it's possible to be limited by pre-shading setup on the chip, depending on what you're doing.

b) Quite possibly, and CB storage just becomes a reuse of L1. I'm leaning more towards this than I was a couple of days ago, and I should find out either way quite soon.

c) Well, given NVIO appears to be in the region of 30-40M transistors, it's not outside the realms of possibility, but I think it's likely NOT the case.

PS: Currently you can get close to peak throughput in terms of base instruction issue via simple shaders, so while it's not the scientific truth, it's possible to make the chip work very hard indeed. The real test comes when we start to profile shipping shaders in games to see what happens there.

dnavas
10-Nov-2006, 23:07
I note that in their tech docs, Nvidia grouped the 16 stream procs inside the clusters into groups of 8. Any idea why they might have done that?

Arun
10-Nov-2006, 23:29
I note that in their tech docs, Nvidia grouped the 16 stream procs inside the clusters into groups of 8. Any idea why they might have done that?Pretty diagrams are better diagrams. Note that this only applies in this specific instance, and does not imply that the Rysgram is better than the Uttargram. Right...? :( I really like Rys' diagram artistic and colourful feeling though, and it beats NVIDIA's any day and in mostly every way imo. And unlike my shit, it's readable and/or understandable, at least! :o

Uttar

dnavas
11-Nov-2006, 00:04
Discretion is the better part of valor. :lol:

It's just that 8s seem to exist all over this arch -- 8 TMUs, 8 clusters, batch size of 8 quads (but only when you're dealing with quads, it's only 16 for vertices). So, when I see 16 units, which seems like a fine thing in a 4x4 grid, separated into not 4 quads, but two sets of 8s, well, it makes me wonder why, y'know? What is it about 8, other than it's a lucky Chinese number....?

As you say, it could be just a gfx artist with too much time on their hands wanting to draw extra boxes.

On the otherhand, if you only have 8 TMUs, do you really want 16 procs coming at them requesting data on a single cycle, or do you run two quarter batches across the 16 procs and dedicate the TMUs to one or the other half? It probably doesn't matter much either way to final performance, just a curiosity....

-Dave

Geo
11-Nov-2006, 11:51
Discretion is the better part of valor. :lol:

It's just that 8s seem to exist all over this arch -- 8 TMUs, 8 clusters, batch size of 8 quads (but only when you're dealing with quads, it's only 16 for vertices). So, when I see 16 units, which seems like a fine thing in a 4x4 grid, separated into not 4 quads, but two sets of 8s, well, it makes me wonder why, y'know? What is it about 8, other than it's a lucky Chinese number....?

As you say, it could be just a gfx artist with too much time on their hands wanting to draw extra boxes.



It's not that we didn't notice and didn't ask --NV was just a bit bashful on some subjects in reply! So the assembled community will have to extract these details bit by bit in the days to come.

DemoCoder
13-Nov-2006, 10:07
Maybe they are grouped like that for a reason, I would not presume the diagram to be arbitrary.

Rodric
13-Nov-2006, 14:26
Page 3, last sentence just before the last paragraph :
"The board will shout at you furiously if you don't connect two, however, should you decide you think you can get away with just the one."

I would think there's an error here.