Education: Why is Performance per Millimeter a worthwhile metric?

Gubbi · Mar 21, 2012

Albuquerque said:
To borrow some terrible car analogy: cubic inches does not equate to horsepower. There are a TON of factors that affect how "efficient" a GPU can be, I don't see how size is one of them.

Price is proportional to performance, cost is proportional to die size. Performance / area is thus a measure of how profitable your product will be.

Only stock holders will care about this. We, the consumers, care primarily about performance / $ and secondary performance/watt (noise under load).

Cheers

denev2004 · Mar 21, 2012

Gubbi said:
Price is proportional to performance, cost is proportional to die size. Performance / area is thus a measure of how profitable your product will be.

Only stock holders will care about this. We, the consumers, care primarily about performance / $ and secondary performance/watt (noise under load).

Cheers

But mostly performance/area has relation with performance/watt...

Dooby · Mar 21, 2012

"Performance per Millimeter" is as worthless a metric to me, a standard home consumer, as "Performance per Watt". I doubt I run a GFX card at full blast long enough for it to make an appreciable difference to my 'leccy bill, so I don't really care that it uses 50w less. When we started seeing the graphs for Kepler and up, and they were on a log scale of "perf/watt" I knew that we'd get 50% more perf (instead of 100% like the good old days) but that the power envelope would stay the same. Low and behold, we get the 680GTX, which I'm still classing as not a worthwhile upgrade over my 4870x2, not a great increase in perf, but it's W is lower.

Performance is measured in one thing only - Will It Play Crysis? So far, "NO". Perf/Anything else therefore, can take a backseat.

3dcgi · Mar 21, 2012

No one's arguing perf/mm^2 should factor into a consumers purchase decision so if that's all readers are concerned about they can ignore any mention of it.

Rodéric · Mar 21, 2012

I care about performance per watt.
I don't like destroying my planet just for entertainement thank you !
(And neither should you be allowed to BTW

)

Malo · Mar 21, 2012

Davros said:
When being asked for a video card recommendation no ones ever asked how big is the gpu.

This is increasingly frustrating here. This is hardly the place where the layman is a factor in determining how/why aspects of a GPU are better/worse. If it were then every time a new GPU was released, all you would need is a thread asking if it's faster in Farmville. This is Beyond 3D.

rpg.314 · Mar 22, 2012

Davros said:
From who's veiwpoint ?

From architecture's pov.

if cards containing both chips are the same price then chip A wins

From a consumer's pov.

rpg.314 · Mar 22, 2012

Albuquerque said:
But power has obvious offshoots that make sense -- more power means more heat, means more cost to operate, means more regulation circuitry to operate correctly. Die size doesn't have any intrinsic limits, except an absolute ceiling on size. Yeah, ok, so a comparatively "big" chip is going to have a higher initial cost than a "small" chip, but when we're talking about really BIG ticket prices (the example I gave was a $9 final package difference on a $599 MSRP card) is essentially rounding error in the final cost to the consumer. I understand that bill of materials is going to see it as a higher percentage impact, but even at that price level, is the BOM going to see a $9 hike as anything larger than 10% at most? I don't know, this actually is a question that I have...

Yes, bigger dies cost more, non linearly more. And the bigger you get, the steeper is the climb. It's even more steep for a brand new process like 28nm.

I get the basic thought that you're conveying, but that assumes "All else is equal" -- and it never is. What performance metric is 20% better? Is it floating point operations? Integer operations? Texture filtering? Raster ops? What if you're trying to tell me that floating point ops is the part that only gained 20%, but it turns out that it's purposefully limited by the vendor (ie, what we already have today?)

Those are synthetic microbenchmarks.

They are useful for architectural exploration, but not for evaluation by any sane measure. Like silent_guy pointed out, you have to consider entire applications.

The problem is exascerbated by the fact that, in your example, A is still 20% faster than B. IN absolute terms, A wins, regardless of die size. When you declare B the winner by virtue only of a smaller die, you have now placed some direct importance on the size of the die -- but why does this matter?

It matters because some (myself included) care about architecture. Deeply. In this example, if everything else like power, features, price etc. were equal, as a consumer I might still end up buying chip B.

If the argument is price, then we would have to assume that B is half the cost of A -- but that isn't going to be true. What happens if a 100% larger die only costs 10% more to make? Does that make A the winner? What happens if the half-sized die needs two extra PCB layers and requires eight memory chips rather than four to fill out the required buswidth? Is the half-sized die still a winner?

PCB costs are rounding error. As for memory chips, you got that the other way around. Smaller dies usually have narrower
interfaces. Even otherwise, die costs are probably the biggest contributor to BOM. As for 2x bigger die only 10% more expensive to make, that's almost impossible for realistic sizes and cutting edge processes.

Die size, as part of an entire video card, still seems utterly meaningless. There are dozens if not hundreds of other things that will affect price and performance; the BOM isn't the die and an HDMI connector. Even if it were, every possible performance metric would NOT be exactly 20% different between the two.

Performance difference lie in the eyes of the beholder.

dkanter · Mar 23, 2012

Albuquerque said:
Alright, fair enough. I didn't realize that physical layout had come to the point where density is basically constant -- in relative terms. My ignorance on that factor then precipitated my inability to get my head wrapped around why die size matters in performance terms.

I get it now, or at least, more than I did four hours ago Thank you for all the replies and the time you spent to explain it in a way that I could understand!

Physical design is not nearly that simple or uniform, but it is abstracted.

As Silent Guy alluded to earlier, different transistors can have leakage that varies by around 10-100X. The drive strength (i.e. speed) probably varies by around 2-8X. The size of the transistor is a fairly important variable as well, and can directly influence those other two factors. E.g. larger transistors leak less. So there is always a trade-off between power, area and performance.

The reason that you care about perf/mm2 or perf/W is that it captures that trade-off and normalizes for different factors.

If you have a GPU that is 1.2X faster, but is 8X larger, it's not a very efficient design. It's the best design for people who need the maximum performance...but for anyone who cares about cost it's going to be a worse choice.

The same goes for power.

That being said, as a consumer...you don't really care directly about perf/mm2. However, you care about a number of factors (e.g. power, cost, performance, board size, etc.) that are all profoundly influenced by perf/mm2 and perf/W.

So studying perf/mm2 and perf/W is valuable, because it can give you insights about the things you do care about.

DK

3dilettante · Mar 23, 2012

One thing I've been mulling over concerning perf/mm2 is that top-end comparisons usually work because we compare designs on mostly comparable processes, or their direct predecessors.
With the rising costs and lengthening delays in making node transitions, and the lack of guarantees that making a transition is beneficial, perf/mm2 analysis may become more complicated.

AMD made a conscious decision prior to the 32nm cancellation that its lower-end chips would remain on 40nm, because of the cost argument.
28nm has not yet given us the volumes or perf/$ many have expected.
AMD may be pressured to lower prices or roll out some kind of OC edition, but supply limitations may push out the crossover point where any downward trend in demand crosses its supply.

Slides were shown with chip design costs more than doubling from 45nm to 22nm, I'm not certain if that includes escalating mask costs.
If 2.5 and 3D integration come to pass in wide scale, we may be comparing designs whose perf/mm2 per layer is inferior, but whose cost/mm2 and cost per design is competitive.

While it is generally not considered elegant or efficient to throw a lot more silicon at something, it doesn't look so bad if--and I am not claiming this is true yet--you can achieve the same end result while pocketing some cash savings.

When silicon scaling does hit a wall, we may have to start comparing perf/mm3 or perf/elements used in the recipe as designers and fabs work their way around any impediments.

upnorthsox · Mar 23, 2012

3dilettante said:
One thing I've been mulling over concerning perf/mm2 is that top-end comparisons usually work because we compare designs on mostly comparable processes, or their direct predecessors.
With the rising costs and lengthening delays in making node transitions, and the lack of guarantees that making a transition is beneficial, perf/mm2 analysis may become more complicated.

AMD made a conscious decision prior to the 32nm cancellation that its lower-end chips would remain on 40nm, because of the cost argument.
28nm has not yet given us the volumes or perf/$ many have expected.
AMD may be pressured to lower prices or roll out some kind of OC edition, but supply limitations may push out the crossover point where any downward trend in demand crosses its supply.

Slides were shown with chip design costs more than doubling from 45nm to 22nm, I'm not certain if that includes escalating mask costs.If 2.5 and 3D integration come to pass in wide scale, we may be comparing designs whose perf/mm2 per layer is inferior, but whose cost/mm2 and cost per design is competitive.

While it is generally not considered elegant or efficient to throw a lot more silicon at something, it doesn't look so bad if--and I am not claiming this is true yet--you can achieve the same end result while pocketing some cash savings.

When silicon scaling does hit a wall, we may have to start comparing perf/mm3 or perf/elements used in the recipe as designers and fabs work their way around any impediments.

I was just looking at this one the other day:

3dilettante · Mar 23, 2012

The top end of the design cost ranges could show a $400 million dollar difference in design costs, so even if a next-node design is better than an oversized current-node device, is it $400 million better?

In terms of GPUs, interposers and die stacking may be something to watch out for, even if no radical new design changes take place in the graphics architecture.

Let's posit that 2.5D integration and a wide stacked GDDR standard became feasible in the time frame of that slide (just to have some numbers to play with). A 300-400 mm2 redesigned GPU on the next node with the usual PCB mounting might face off against a 28nm dual gpu (say Pitcairn++, or Tahiti++ if you're into that kind of thing) with stacked DRAM.

In this manufactured scenario, the next node might not win. The interposer might make certain things possible that weren't before, such as a very high-bandwidth link between GPUs.

rpg.314 · Mar 24, 2012

Even then, there would be an OoM throughput and latency penalty for going off die. Graphics architectures will have to change significantly to scale with that.

3dilettante · Mar 25, 2012

The latencies in question are in the same order of magnitude as a memory access. An access to the second die will require a second hit to the memory controller or associated L2 slice, then possibly on to a DRAM module. It may be 2-5x the latency, but there are optimizations that can be made to make accesses to the other GPU bypass the rather significant buffering and queueing related to high utilization of GDDR on the first die.

The latency of NUMA for CPUs is within 1.5-2x a local hit, and that is considered close enough to treat as a single pool.
GPUs can tolerate much higher latency; what they could not obtain was the raw bandwidth.
In terms of throughput, a link would have the same pitch and signalling rates as available to the on-interposer DRAM.
Various streamout scenarios on single GPUs already contend with similar round trips to memory.

rpg.314 · Mar 25, 2012

I wasn't speaking of DRAM latency, which GPUs are good at hiding. I was speaking of the latency to go off die for the second GPU die. A sort last renderer *might* not be a good fit for that.

But if motivates them to switch to sort middle,

CarstenS · Mar 25, 2012

dkanter said:
As Silent Guy alluded to earlier, different transistors can have leakage that varies by around 10-100X. The drive strength (i.e. speed) probably varies by around 2-8X. The size of the transistor is a fairly important variable as well, and can directly influence those other two factors. E.g. larger transistors leak less. So there is always a trade-off between power, area and performance.

So, all other factors being equal, making your transistors (or the most critical transistors in this case) a bit larger on purpose could make them less leaky? Is that (one the things) you're implying above? For example forgoing absolutely smallest overall die size in order to achieve more consistent results across your yielded chips?

silent_guy · Mar 25, 2012

CarstenS said:
So, all other factors being equal, making your transistors (or the most critical transistors in this case) a bit larger on purpose could make them less leaky? Is that (one the things) you're implying above? For example forgoing absolutely smallest overall die size in order to achieve more consistent results across your yielded chips?

I must admit that I wasn't aware of this. It's not something you get exposed to in the standard cell world: you simply have a ton of different cells with matrix of different speeds and different driving strengths. How this is implemented inside the cell is something you never look at.

CarstenS · Mar 25, 2012

So... that's a "no" to my question?

3dilettante · Mar 26, 2012

There appear to be density trade-offs made at least implicitly for the sake of mitigating variation, meaning that the transistors being used could be smaller but default to larger sizes.

Intel's SRAM cell sizes are often larger than those offered by foundry processes, such as TSMC.
However, when it comes to designing something manufacturable, Intel's actual products tend to have cell sizes that are the size advertised.
One of the reasons is to provide enough margin for a mass-produced product's memory to meet its reliability and power requirements at the desired voltages.

dkanter · Mar 30, 2012

silent_guy said:
I must admit that I wasn't aware of this. It's not something you get exposed to in the standard cell world: you simply have a ton of different cells with matrix of different speeds and different driving strengths. How this is implemented inside the cell is something you never look at.

Yes. If you look at presentations, you'll see a number of discussions of "long Leffective" devices. Those are slightly longer transistors that have lower leakage.

DK

Education: Why is Performance per Millimeter a worthwhile metric?

Gubbi

denev2004

Dooby

3dcgi

Rodéric

a.k.a. Ingenu

Malo

Yak Mechanicum

rpg.314

rpg.314

dkanter

3dilettante

upnorthsox

3dilettante

rpg.314

3dilettante

rpg.314

CarstenS

Moderator

silent_guy

CarstenS

Moderator

3dilettante

dkanter

Similar threads