The Official NVIDIA G80 Architecture Thread

Wow, it's amazing that we call those dies "big". We've really gotten to a point where we take semi-conductor tech for granted given how much it does with so little :)
 
Wow, it's amazing that we call those dies "big". We've really gotten to a point where we take semi-conductor tech for granted given how much it does with so little :)

Go ahead, grandpa --give us a history lesson. ;) So far as I can tell 200-300 mm^2 has not been all that historically rare. . .but much above 300mm^2 is. Tho I understand that ITRS has projected up to 800mm^2 by 2008 and Intel has talked about 450mm wafers.
 
Well honestly, I for myself can't remember something bigger than G80 to this date -- even the Tulsa die area is a bit notch under the Graphzilla's monster.
 
Go ahead, grandpa --give us a history lesson. ;) So far as I can tell 200-300 mm^2 has not been all that historically rare. . .but much above 300mm^2 is. Tho I understand that ITRS has projected up to 800mm^2 by 2008 and Intel has talked about 450mm wafers.

That comment wasn't based on relative die sizes! :) I'm just talking in absolute terms - I doubt my grandmother could fathom that it's possible to do what these things do in 400 or even 1000mm^2.
 
Some questions I've been wanting to ask...to anyone who can answer

In G80 their are 8 clusters, each cluster seams to have two shader arrays (eight scalar ALUs in each array).

Question 1. Can each array have a different thread type (VP, PS, GS) assigned to it?

Question 2. Will it make more sence to increase the number of Clusters in a futire high-end revision of G8x (keeping the current configuration of 2 shader arrays per cluster) OR to add more '8 ALU arrays' into each cluster?

Question 3. Is there a tight dependancy between the number of arrays per Cluster and the Texture Address/Filter units?

G80's ALU-to-Texture ratio was a bit of a suprise at (2:1).

Question 4. Do you think that future high-end revisions of G8x will increase this ratio in favor of more ALUs?
 
G80's ALU-to-Texture ratio was a bit of a suprise at (2:1).
You may want to look at FLOPs vs bilnear samples/clock as a better meteric for comparison.

128 units * 1350 MHz * 3 ops/clock / 64 bilinears / 575 MHz = ~14.1 scalars/bilinear, or a vector ratio of ~3.5:1.

If you ignore the extra MUL/SFU hardware, the ratio becomes closer to 2.5:1.
 
You may want to look at FLOPs vs bilnear samples/clock as a better meteric for comparison.

128 units * 1350 MHz * 3 ops/clock / 64 bilinears / 575 MHz = ~14.1 scalars/bilinear, or a vector ratio of ~3.5:1.

If you ignore the extra MUL/SFU hardware, the ratio becomes closer to 2.5:1.

Whoops, I forgot about MUL/SFU! My rough value came to ~2.34:1.

Your right, it much closer to a 3.5 ratio. Thanks Bob.

nAo said:
how do you know that?

This is from the marketing diagrams (I know very bad to take them literally) and that the VS thread/batch size 16 (single 8-way ALU array over 2 clocks) and pixel batch of 32 (single 8-way ALU array over 4 clocks). Unless of course each cluster is 16-way which is what I orginally thought, but several posts in this thread made me think otherwise.... now I'm just confused :).
 
128 units * 1350 MHz * 3 ops/clock / 64 bilinears / 575 MHz = ~14.1 scalars/bilinear, or a vector ratio of ~3.5:1.
If you ignore the extra MUL/SFU hardware, the ratio becomes closer to 2.5:1.
Arguably, you could say the following, too, but you may also claim it isn't completely fair:
R520: 16 units * 650 MHz * 3 ops/clock / 16 bilinears / 650 MHz = 3:1
R580: 48 units * 650 MHz * 3 ops/clock / 16 bilinears / 650 MHz = 9:1

Obviously, the biggest problem with these numbers is that these units aren't quite equivalent to four of G80's scalar units. To simplify comparaison, first, let us not take into consideration the MUL/SFU units, as ATI also has a dedicated MUL for perspective correction, and some dedicated hardware for SFU (I'm not sure how it compares to NVIDIA's, so let us assume it is roughly identical).

Secondly, ATI's units are Vec3+Scalar, which is obviously less efficient than four purely scalar units. In the worst possible case, it's half as efficient; on average, it's certainly not quite that bad. Furthermore, the R580 also has extra ADD units. They aren't always usable and/or exposed, but they still are far from dormant.

So, let's be really generous and say that for advanced workloads (which, you coloweruld argue, might have more scalar ops), four scalar units in NVIDIA's architecture have a 20% higher average effective throughput than one arithmetic pipeline in R5xx. ATI would say it's lower, NVIDIA would say it's higher, so let's keep that as a reasonable guestimate, shall we? For G71, I'd say 2:1 is also a fair estimate, but that's extremely subjective; given the various inefficiencies compared to R520, I'd be tempted to say it's slightly lower than that, but then again its theorical peak is slightly higher too...

Anyway, we have the following (quite subjective and approximative, obviously) numbers:
G80: 1.2*128/4 R5xx-equivalent units * 1350 MHz * 2 ops/clock / 64 bilinears / 575 MHz = 3:1
R520: 16 units * 650 MHz * 2 ops/clock / 16 bilinears / 650 MHz = 2:1
R520: 48 units * 650 MHz * 2 ops/clock / 16 bilinears / 650 MHz = 6:1

And feel free to say that was an exercise in futility as it remains highly approximative, but I think it clearly illustrates the point that G80 has an ALU ratio between that of R520 and R580's, and how much far it is from R580's depends on the latter's efficiency for the given workload. Obviously, in the future, there will be some room to grow G8x's ALU ratio IMO, but it remains to be seen by how much and what timeframes NVIDIA thinsk this will be necessary in.

For G84 and/or G86, an easy way to increase the ratio would be to get rid of the extra bilinear unit per addresser. You would expect the low-end part to be tested less frequently with heavy anisotropic filtering than the high-end ones, so such a compromise would make sense. Another idea would be to make a MADD out of the current MUL, but it's hard to say how expensive that would be. Or, they could do neither, or reserve that for future parts, heh. Who knows at this point :)


Uttar
 
Thanks Utter you post underlined some of my formative thoughs before I posted about ratios earlyer.

It's the extra bilinear unit per texture addresse unit that was the suprise, as I expected 32 address + 32 bilinear (which would of made a higher ALU-to-Texture ratio).

Not that I'm complaining, I'm over the moon about G80 (and it's pleasant suprises :)).
 
Last edited by a moderator:
Ive got a noob question that i cant work out:

What I dont get with NV architecture is how you can increase the clusters without increasing the bus width, as I thought each cluster was implicity linked to a 64bit memory bus. 8 clusters?

So how do you go about adding extra units/clusters/rops without increasing the bittage of the memory bus?
 
Last edited by a moderator:
Uttar - another easy way to tweak the ratio would be to just adjust the clock of the ALU domain (if they have a lot of frequency headroom on those). That seems like a potentially big plus for the dual clock domain approach. With a single clock domain, the ratio is much more set in stone.
 
Ive got a noob question that i cant work out:

What I dont get with NV architecture is how you can increase the clusters without increasing the bus width, as I thought each cluster was implicity linked to a 64bit memory bus. 8 clusters?

So how do you go about adding extra units/clusters/rops without increasing the bittage of the memory bus?

Which clusters do you mean? Nvidia's marketing diagrams show 8 ALU+TEX clusters (do we have a good name for these?), connected by a crossbar to 6 ROP clusters. Each ROP cluster is directly connected to a 64-bit memory channel. So the ratio of ROPs to bus width is fixed, but the ratio of ALU/TEX horsepower to bus width can vary.

This is fairly similar to how NV4x/G7x worked, except that those are apparently limited to powers-of-two numbers of ROPS (at least, Nvidia never shipped any 3 ROP/192-bit parts). G80 is clearly more flexible in that respect.
 
Uttar said:
I think it clearly illustrates the point that G80 has an ALU ratio between that of R520 and R580's, and how much far it is from R580's depends on the latter's efficiency for the given workload.
R580 also has ~1/4 the bilinear filtering horsepower as G80, but about as much ALU power (theoretically, ignoring the G80 Missing MUL). I would really hope its shader:texture ratio is high!

Uttar said:
For G84 and/or G86, an easy way to increase the ratio would be to get rid of the extra bilinear unit per addresser. You would expect the low-end part to be tested less frequently with heavy anisotropic filtering than the high-end ones, so such a compromise would make sense.
True. However, lower-end parts are also tested at a lower resolution, which tends to increase the required amount of aniso.

Uttar said:
Another idea would be to make a MADD out of the current MUL, but it's hard to say how expensive that would be. Or, they could do neither, or reserve that for future parts, heh. Who knows at this point
The problem with going with a MAD (instead of a MUL) is that you need extra RF ports, wiring logic, routing complexity, buffering, scheduling, etc. The big cost is not really the 'add' itself. The ADD hardware would also be idle when not runing a MAD-heavy shader (which would be most of them).

Now, you can point at G70 and say that MUL was turned into MAD there, so why shouldn't all GPUs go this route? Well, different GPUs have different design point, and different architectures. What makes sense in one does not necessarily make sense in another; the benefit of having a MAD in G70 is much greater than that of turning the MUL into a MAD (at least in terms of performance per area). Otherwise, wouldn't it already have been done?
 
  • Like
Reactions: Geo
Question 2. Will it make more sence to increase the number of Clusters in a futire high-end revision of G8x (keeping the current configuration of 2 shader arrays per cluster) OR to add more '8 ALU arrays' into each cluster?

Both (not necessarily at the same time).

Question 3. Is there a tight dependancy between the number of arrays per Cluster and the Texture Address/Filter units?

We don't know. My guess is no (see below).

Question 4. Do you think that future high-end revisions of G8x will increase this ratio in favor of more ALUs?

Yes, if apps move in that direction.

It seems to me that beginning with NV4x, Nvidia has been obsessed with scalability. This was probably a reaction to the NV3x family: when your high end part has a single quad pipe, how do you scale down? You have to redesign the shader core making different perf/area tradeoffs. That's a lot more effort than just dropping one of several identical units. Even ignoring all the other faults of the architecture, I think this hurt them a lot in the mid-range and low end markets.

In NV4x/G7x, they've shipped parts with 1-8 vertex units (IIRC), 1-6 quad-pipes, and (1,2,4) ROPs. They've changed the ratios between all three of those, so they apparently can pick these numbers independently. On top of that they have multiple clock domains. All of this adds up to tremendous flexibility and a huge space of possible chips they could build in response to customer demand or what ATI put up against them. And they could do it relatively quickly.

I expect G80 to be more of the same. Personally, I think that's one of the most important aspects of the 384-bit bus: it means they're no longer tied to a power-of-two bus width. The difference between 256 bits and 512 bits is huge; having an architecture that lets you hit several intermediate points between could give you a big advantage against a competitor who can't. Obviously we don't know yet, but given their obsession with flexibility, I'd be surprised if they can't fine-tune the ALU:TEX ratio by changing the number of SIMD arrays within each cluster.
 
R580 also has ~1/4 the bilinear filtering horsepower as G80, but about as much ALU power (theoretically, ignoring the G80 Missing MUL). I would really hope its shader:texture ratio is high!
Heh, yeah :) Part of the reason it's lower is that the dies are 480 and 350mm2 respectively, but that's only a 27% difference, not a 75% one. I guess from another point of view, you could just say NVIDIA managed to put a lot of texture filtering units given the die size. Heck, the increase is even a bit higher than the die size increase between G71 and G80, which certainly is quite impressive.

True. However, lower-end parts are also tested at a lower resolution, which tends to increase the required amount of aniso.
Hmm, good point, I hadn't properly considered that. Now, even if you only activated trilinear filtering and not anisotropic filtering, you'll still get less magnification than at a higher resolution.

The ADD hardware would also be idle when not runing a MAD-heavy shader (which would be most of them).
One way to handle that would be to add an ADD but only the necessary logic to use either the MUL or the ADD. Either way, that probably wouldn't add much performance, and it most likely wouldn't even be worth the transistors - although it would obviously be quite hard for me to know that for sure, heh :)


Uttar
 
Back
Top