Education: Why is Performance per Millimeter a worthwhile metric?

Albuquerque

Red-headed step child
Moderator
Legend
I started a slightly tongue-in-cheek rant in the Kepler thread, and more than a few people wanted to respond. Some were more snarky in their response than others, so I figured this was a good time for a new thread.

Some stuff that I wanted to respond to, but didn't because it didn't belong in that thread:
It has a direct bearing on absolute performance when competitors have considerably different die-sizes.
Actually, no. Performance per millimeter IS absolute performance in the face of different die sizes. I'd like to see an example of any vendor targeting a specific performance per millimeter metric, and then allocating a specific number of millimeters to ensure they meet their performance requirement.

Here's a rhetorical hint: they don't do that. Yes, I know you know this, I'm not trying to call you stupid. But that quote doesn't answer anything in terms of WHY this mystical perf / mm^2 is truly important.

What's interesting this time around is that perf/mm2 seems to be very close between the 2 parties, after correcting for compute features of Tahiti.
Does it? Is double-precision flops the same? Is integer ops the same? Is filtering capacity the same? Primitive rate? Tesselation rate? Raster rate? Total megahertz? Texture addressing rate? SQRT rate?

I know what you're going to say, and you're right -- those are functions of the individual computation units. But that's the crux of this whole problem I have with Perf / mm^2... The performance metric is abstract at absolute best; increasing clock speed will skew it. Having a "hot clock" will also skew it, along with "boost" clocks and whatever other timing tomfoolery. One architecture might wallop another in DP flops, but might suck the proverbial canal water when compared in integer ops or texture filtering rate.

NONE of this perf/mm^2 metric has any standard unit; it has no direct bearing on game performance, it has no direct bearing on GPGPU performance, it is (in my opinion) purely a speculative number based on other speculated numbers. It's like we're abstracting the abstract to come up with something even more ethereal.

I get that you all want to know how many FLOPS you can cram into a square millimeter, but that FLOPS number has zero bearing on any reality ANYWHERE. I can't take a flops number and translate it into frames per second on a game, or GPGPU workload, or anything tangible. Which IMO makes it utterly worthless; it's an imaginary number for an imaginary metric.

Please FIRST help me understand how to properly measure perf / mm*2 in a way that makes sense, and then please show me how that applies to anything in real life.

Thanks in advance,
-ABQ
 
Please FIRST help me understand how to properly measure perf / mm*2 in a way that makes sense, and then please show me how that applies to anything in real life.
Select workloads of interest for the intended audience for a product, and then take a weighted average across these workloads.

Think Specint/specfp for cpu's.

EDIT: NOT synthetic (micro)benchmarks.
 
Last edited by a moderator:
Select workloads of interest for the intended audience for a product, and then take a weighted average across these workloads.

Think Specint/specfp for cpu's.

I get the performance aspect, similar to what you describe re: SpecInt. I even understand why those "make sense" in an abstract method. But SpecInt isn't factored on physical size of the CPU, nor is it factored on total transistor count of the CPU. It's a pure throughput number and not much else.

Similarly, die size doesn't make sense in this example. Die size is interesting for thinking about how much room you have for bus / plane interconnects or maybe total contact pitch for a heatsink. I even understand how total die size impacts cost in an abstract sense (I have other "complaints" regarding how much importance some will assign to that, but will not go into that detail unless someone asks.)

Maybe the problem I'm having is understanding why die size is an important aspect of performance. Because that's what we're assigning it.

To borrow some terrible car analogy: cubic inches does not equate to horsepower. There are a TON of factors that affect how "efficient" a GPU can be, I don't see how size is one of them.
 
Maybe the problem I'm having is understanding why die size is an important aspect of performance. Because that's what we're assigning it.
No. Neither die size, nor power or performance matter all by themselves. You have to normalize for area or power to make meaningful comparison across architectures, that's all.

If chip A is twice as big as chip B and performs only 20% better, then chip B is a hands down winner even though chip A has more absolute performance.
 
If chip A is twice as big as chip B and performs only 20% better, then chip B is a hands down winner even though chip A has more absolute performance.

From who's veiwpoint ?
if cards containing both chips are the same price then chip A wins
 
What do you mean by "efficient"?
I put it in quotes purposefully to reflect any number of alternative "efficiency" metrics. If you would like to provide helpful assistance in this thread, kindly insert your favorite "efficiency" metric, and then kindly explain why die size matters in your opinion. Or if you would prefer to continue your snark with the color of the sky in my world, feel free to get ignored :)

No. Neither die size, nor power or performance matter all by themselves. You have to normalize for area or power to make meaningful comparison across architectures, that's all.
But power has obvious offshoots that make sense -- more power means more heat, means more cost to operate, means more regulation circuitry to operate correctly. Die size doesn't have any intrinsic limits, except an absolute ceiling on size. Yeah, ok, so a comparatively "big" chip is going to have a higher initial cost than a "small" chip, but when we're talking about really BIG ticket prices (the example I gave was a $9 final package difference on a $599 MSRP card) is essentially rounding error in the final cost to the consumer. I understand that bill of materials is going to see it as a higher percentage impact, but even at that price level, is the BOM going to see a $9 hike as anything larger than 10% at most? I don't know, this actually is a question that I have...

If chip A is twice as big as chip B and performs only 20% better, then chip B is a hands down winner even though chip A has more absolute performance.
I get the basic thought that you're conveying, but that assumes "All else is equal" -- and it never is. What performance metric is 20% better? Is it floating point operations? Integer operations? Texture filtering? Raster ops? What if you're trying to tell me that floating point ops is the part that only gained 20%, but it turns out that it's purposefully limited by the vendor (ie, what we already have today?)

The problem is exascerbated by the fact that, in your example, A is still 20% faster than B. IN absolute terms, A wins, regardless of die size. When you declare B the winner by virtue only of a smaller die, you have now placed some direct importance on the size of the die -- but why does this matter?

If the argument is price, then we would have to assume that B is half the cost of A -- but that isn't going to be true. What happens if a 100% larger die only costs 10% more to make? Does that make A the winner? What happens if the half-sized die needs two extra PCB layers and requires eight memory chips rather than four to fill out the required buswidth? Is the half-sized die still a winner?

Die size, as part of an entire video card, still seems utterly meaningless. There are dozens if not hundreds of other things that will affect price and performance; the BOM isn't the die and an HDMI connector. Even if it were, every possible performance metric would NOT be exactly 20% different between the two.

Again, why is size specifically important in terms of performance?
 
Albuquerque said:
I know what you're going to say, and you're right -- those are functions of the individual computation units. But that's the crux of this whole problem I have with Perf / mm^2... The performance metric is abstract at absolute best; increasing clock speed will skew it. Having a "hot clock" will also skew it, along with "boost" clocks and whatever other timing tomfoolery. One architecture might wallop another in DP flops, but might suck the proverbial canal water when compared in integer ops or texture filtering rate.

NONE of this perf/mm^2 metric has any standard unit; it has no direct bearing on game performance, it has no direct bearing on GPGPU performance, it is (in my opinion) purely a speculative number based on other speculated numbers. It's like we're abstracting the abstract to come up with something even more ethereal.

I get that you all want to know how many FLOPS you can cram into a square millimeter, but that FLOPS number has zero bearing on any reality ANYWHERE. I can't take a flops number and translate it into frames per second on a game, or GPGPU workload, or anything tangible. Which IMO makes it utterly worthless; it's an imaginary number for an imaginary metric.

Please FIRST help me understand how to properly measure perf / mm*2 in a way that makes sense, and then please show me how that applies to anything in real life.
Useful performance is what the users experience as useful. It's usually NOT about individual computation units. Not at all.

There is no single number or standard unit and there doesn't have to be: it depends what you want to do with it.

For some gamers, it's FPS on average, for others it's FPS in Metro2033, for others it's FPS in BF3 but only at 1080p and nothing else.
For a Wall Street quant it's the number of Black-Scholes calculations he can do per second.
For someone else, it's DGEMM flops for very large matrices.

In most of our discussion here, we implicitly assume that we're using average FPS for different games. But then when we talk about a Lux benchmark, it's about compute loads with little or no graphics related functionality. In general, there is a high correlation between whatever perf metric you chose, but sometimes there is not: see 6970 vs. 580 where 6970 vastly outperforms for bit coin operations.

Average FPS for multiple games per mm2 is obviously relevant to real life for GPU makers. I know you know that. Does that mean it has relevance to my real life? If I want a GPU with a particular performance, a better perf/mm2 will give me lower leakage, but since I don't game and thus I'll never buy a big GPU, that relevance is lost on me. But it's relevant to me because it tickles my curiosity. It doesn't need to be more than that, just like others may find disco era Albanian post stamps relevant to their life. ;)
 
Davros said:
From who's veiwpoint ?
if cards containing both chips are the same price then chip A wins
From the viewpoint of anyone who's interested in these things? Isn't that sufficient? Isn't that what an architecture forum is for?

Don't you wonder why a GPU with only 512 ALUs beats a GPU with 1536 in most cases? Or why a GPU with a 384 bits bus barely outperforms one with 256?
 
Average FPS for multiple games per mm2 is obviously relevant to real life for GPU makers. I know you know that. Does that mean it has relevance to my real life? If I want a GPU with a particular performance, a better perf/mm2 will give me lower leakage

I'll hit your second point first: keeping in mind that this may be a dumb question, but does absolute die size dictate leakage? I get why, in absolute terms, a larger die will leak "more". However, as I understand the lithography process in whole, in terms of energy lost versus total transistor count, larger dies are not implicitly more leaky, correct? It's just that a four billion transistors are just gonna leak more than two billion transistors, right?

Now for your first point: I acknowledge that GPU manufacturers will implicitly care about perf / mm*2 in some abstract sense, but I cannot agree that they lay down any sort of target. I have no picture in my head of a bunch of super-smart EE folks behind closed doors at NV or AMD headquarters deciding how much more performance per millimeter they're going to squeeze into the next GPU architecture.

What I DO see happening is differing teams with differing goals in the project. I see a team who is going to build the next new texture management unit that will provide EPIC BogoFLops with only three hundred thousand transistors each and under 0.8uW per transaction (or some such.) When it goes to layout and they realize the 'floorplan' for this new TMU sucks balls with the current ruleset, they'll re-engineer it.

Perhaps better said: I see perf / mm*2 to be a very important resultant metric when it's all said and done, but not a discrete number on the wall behind closed doors.

The trick is this: the REAL performance metric would / should be performance per transistor. But we don't know the total transistors, and we don't know density. The fact that we're using die area seems even more unlinked from reality than total transistor count and/or density, but perhaps that's just my opinion.
 
From the viewpoint of anyone who's interested in these things? Isn't that sufficient? Isn't that what an architecture forum is for?

Don't you wonder why a GPU with only 512 ALUs beats a GPU with 1536 in most cases? Or why a GPU with a 384 bits bus barely outperforms one with 256?

But again, these questions are interesting - and yet, wholly unrelated to die size. That's the crux of this entire thread. You are bringing up important things, but things that are nevertheless not constrained to a specific / defined physical size in terms of silicon substrate. The truly interesting metric here would be transistors needed to accomplish these goals. IF you want to talk die size, let's talk about transistor density -- how'd they pack all that awesome into such a small footprint?

But we have no data to do that with, because the only transistor counts we have are pure marketing. The die size isn't much different, except that marketing can't lie (as much) about die size. But the die size is still meaningless, because you might've packed it with 5,000 transistors that do awesomesauce or 500,000 transistors that do fuck-all. It's still 0.08mm^2, but one is far more bad-ass than the other.
 
Albuquerque said:
Does it? Is double-precision flops the same? Is integer ops the same? Is filtering capacity the same? Primitive rate? Tesselation rate? Raster rate? Total megahertz? Texture addressing rate? SQRT rate?
I elaborated on this earlier: based on the gaming perf of Pitcairn and Tahiti and the number units, I'm simply estimating that the cost of HPC features in Tahiti is ~20% in area. IOW, without, it'd be close in area to a GK104 for similar performance (say, within 10% range). Now one architecture will still be better than the other, but the margin between them is way less than in previous generations. I think we're close to a point where discussion about perf/mm2 are going to be futile due to this convergence: both parties have obviously worked very hard at squeezing in as much efficiency as possible. The low hanging fruit is gone.

What's left then are decisions at the marketing level about how to size a chip for a particular market segment and which whether or not to include HPC functionality.

Right now, it looks like this round goes to Nvidia for choosing not to include HPC features (and catching AMD by surprise?) It will be interesting to see how this plays out in the future. (It also makes things less fun for us, I'm afraid.)
 
Albuquerque said:
I put it in quotes purposefully to reflect any number of alternative "efficiency" metrics.
In other words, you refuse to answer the question so you have a moving goalpost... color me surprised. :rolleyes:
 
I elaborated on this earlier: based on the gaming perf of Pitcairn and Tahiti and the number units, I'm simply estimating that the cost of HPC features in Tahiti is ~20% in area. IOW, without, it'd be close in area to a GK104 for similar performance (say, within 10% range).
I'm not sure if I follow how you got your area calculation. I followed your prior thoughts on this, and you appear to be using Pitcairn as the indicator -- but we have a lot of unanswered questions around transistor density for those "HTPC units". Still, is that die area specifically important since both are going to be priced relatively similarly? I think we're going to land in a place where Tahiti will sell for less money than NV, but potentially have a higher BOM. I'm thinking you will not argue that, right? But the BOM on Tahiti also 50% more memory chips attached to it, which means more traces and potentially more strict tolerances on RF generated by power delivery circuits et al. Is the pricetag on AMD's die that's ~20% larger going to measurably affect the BOM? I suppose neither of us know that for sure, but I'm getting a general "meh" sensation from my gut on this. "Meh" being the sound of "Meh, not really..."

Now one architecture will still be better than the other, but the margin between them is way less than in previous generations.
To be more specific and/or to ensure I'm following along correctly, one architecture will still be better than another at very specific tasks. GPGPU may be stronger on Tahiti, but raw fillrate will likely be better on Kepler. AA performance will probably be stronger on Kepler for the same reason, but pure texturing throughput may end up faster on Tahiti. "On the average", perf/mm^2 could arguably be the same - which is what I believe you are indeed saying. I just want to make sure I am clear on your thoughts...

I think we're close to a point where discussion about perf/mm2 are going to be futile due to this convergence: both parties have obviously worked very hard at squeezing in as much efficiency as possible. The low hanging fruit is gone.
Maybe this is why I'm having troubles... At this level, we're in agreement. Perhaps back in the Riva128 -> GeForce 3 days when pretty much all the hardware was "the same", performance per die area became an interesting metric because all the little individual processing units were still their own islands. Vertex units, texture units, shader units, all lockstep with eachother and each tied to their own ROP engine.

We've come to a point where you can add and remove these units almost trivially (almost ;) ) and we have now muddied the water to an incredible degree - and to a point where perf / mm^2 now gets too far abstract to be meaningful. When we end up with a dozen or more different 'dimensions' of performance that can all be individually tweaked by NV or AMD, it gets awfully hard to put a pinpoint in one and declare ubiquitous victory.

Right now, it looks like this round goes to Nvidia for choosing not to include HPC features (and catching AMD by surprise?) It will be interesting to see how this plays out in the future. (It also makes things less fun for us, I'm afraid.)
In terms of what we know right now, my wallet is agreeing with you :D
 
In other words, you refuse to answer the question so you have a moving goalpost... color me surprised. :rolleyes:
What color is the sky in your world, again? Oh I'm sorry, several other people figured out what was going on -- you just like to troll. I get it now. NO problem, I've heard you loud and clear.

Ignore +1. :)
 
Albuquerque said:
I'll hit your second point first: keeping in mind that this may be a dumb question, but does absolute die size dictate leakage? I get why, in absolute terms, a larger die will leak "more". However, as I understand the lithography process in whole, in terms of energy lost versus total transistor count, larger dies are not implicitly more leaky, correct? It's just that a four billion transistors are just gonna leak more than two billion transistors, right?
Leakage is dictated most by the active area of a transistor and its speed rating, if you will. The numbers of transistors by itself is not enough: different standard cells have a different drive strength because they have a different output driver that uses larger transistors. Ultra fast transistors have a different doping level and leak much more (factor 10 more not unusual). Both area and number of transistors are very coarse proxies for what leakage will be. Just compare a GTX480 and a GTX580.

Now for your first point: I acknowledge that GPU manufacturers will implicitly care about perf / mm*2 in some abstract sense, but I cannot agree that they lay down any sort of target. I have no picture in my head of a bunch of super-smart EE folks behind closed doors at NV or AMD headquarters deciding how much more performance per millimeter they're going to squeeze into the next GPU architecture.
I'm willing to bet money that this is exactly what they discuss, because that are the discussions we have when comparing our competitor's product to ours. And how to improve. Most projects start with a small number of top priorities:
- Create chip that supports standard XYZ.
- Performance target of N
- Power target of P
- Cost/area target of A
CEO level numbers that translate directly in metrics like perf/mm2.

What I DO see happening is differing teams with differing goals in the project. I see a team who is going to build the next new texture management unit that will provide EPIC BogoFLops with only three hundred thousand transistors each and under 0.8uW per transaction (or some such.) When it goes to layout and they realize the 'floorplan' for this new TMU sucks balls with the current ruleset, they'll re-engineer it.
Not possible: layout happens very late in the process. What happens instead is that an architecture gets modeled in some high level language to see if it will perform as expected. E.g. an executable model in some high level language of an execution unit and a number of representative pieces of code mapped to the unit instructions. Once that performs as expected, you can make fairly accurate predictions of what the size of the layout will be, 1 year later: you can rely on experience of past designs.

Perhaps better said: I see perf / mm*2 to be a very important resultant metric when it's all said and done, but not a discrete number on the wall behind closed doors.
No, that's a recipe for disaster. If you're not guided by a target, you'll blow it out of the water. We keep meticulous track of a ton of design parameters and email them out as trending graphs weekly to keep everybody: general regression status, bug statistics, various power parameters, timing status, design area (in mm2, though it's just the sum of cell sizes, not after layout), etc.

The trick is this: the REAL performance metric would / should be performance per transistor. But we don't know the total transistors, and we don't know density. The fact that we're using die area seems even more unlinked from reality than total transistor count and/or density, but perhaps that's just my opinion.
I've never see anyone care about number of transistor in engineering. Mm2 and, to a lesser extent, number of cells are what count.
 
Albuquerque said:
But again, these questions are interesting - and yet, wholly unrelated to die size. That's the crux of this entire thread. You are bringing up important things, but things that are nevertheless not constrained to a specific / defined physical size in terms of silicon substrate. The truly interesting metric here would be transistors needed to accomplish these goals. IF you want to talk die size, let's talk about transistor density -- how'd they pack all that awesome into such a small footprint?
They use standard cells that are placed with a cell density of 70%. ;) Not joking.

The only people who see transistors are the cell designers, more often than not from a different company. Seriously: we really, really, don't know and don't care about individual transistors. It's all abstracted out into cells. And the cells are abstracted into higher level constructs. When you define an architecture, you talk about concepts like ALUs, crossbars, register files etc. That's sufficient to go a very long way.

But the die size is still meaningless, because you might've packed it with 5,000 transistors that do awesomesauce or 500,000 transistors that do fuck-all. It's still 0.08mm^2, but one is far more bad-ass than the other.
No: the cells are all roughly the same size. The ratio of the smallest possible cell (drive 1 invertor) to the largest (say, a high featured scan FF or major buffer) is maybe factor 15? Most cells are somewhere at 6? If all averages out nicely.
 
Alright, fair enough. I didn't realize that physical layout had come to the point where density is basically constant -- in relative terms. My ignorance on that factor then precipitated my inability to get my head wrapped around why die size matters in performance terms.

I get it now, or at least, more than I did four hours ago :D Thank you for all the replies and the time you spent to explain it in a way that I could understand!
 
Now for your first point: I acknowledge that GPU manufacturers will implicitly care about perf / mm*2 in some abstract sense, but I cannot agree that they lay down any sort of target. I have no picture in my head of a bunch of super-smart EE folks behind closed doors at NV or AMD headquarters deciding how much more performance per millimeter they're going to squeeze into the next GPU architecture.
Designers should always care about perf/mm^2, but there's not some hard target just knowledge the higher the ratio the better off you are. A good analogy is someone gaining weight. You make a lot of little decisions in life about skipping exercise and eating desert and all of a sudden wonder where those 10 or 20 pounds come from. Then you decide to do something about it. Hence R600 -> R770 and Fermi -> Kepler.

And I agree with silent_guy that 70% utilization is a pretty good starting point for physical designers.
 
Back
Top