Die Stacking for Desktop GPUs?

Jawed

Legend
Die Stacking Apparatus and Method

[0004] Conventional integrated circuits are frequently implemented on a semiconductor substrate or die that consists of a small rectangular piece of semiconductor material, typically silicon, fashioned with two opposing principal sides. The active circuitry for the die is concentrated near one of the two principal sides. The side housing the active circuitry is usually termed the "active circuitry side," while the side opposite the active circuitry side is often termed the "bulk silicon side." Depending on the thermal output of the die, it may be desirable to mount a heat transfer device, such as a heat sink, on the bulk silicon side of the die. This mounting may be directly on the bulk silicon side or on a lid that is positioned over the die.

[0005] A conventional die is usually mounted on some form of substrate, such as a package substrate or a printed circuit board. Electrical connectivity between the die and the underlying substrate or board is established through a variety of conventional mechanisms. In a so-called flip-chip configuration, the active circuitry side of the die is provided with a plurality of conductor balls or bumps that are designed to establish a metallurgical bond with a corresponding plurality of conductor pads positioned on the substrate or circuit board. The die is flipped over and seated with the active circuitry side facing downwards on the underlying substrate. A subsequent thermal process is performed to establish the requisite metallurgical bond between the bumps and the pads. One of the principal advantages of a flip-chip mounting strategy is the relatively short electrical pathways between the integrated circuit and the substrate. These relatively low inductance pathways yield a high speed performance for the electronic device.

[0006] In some circumstances it may make sense from a performance standpoint to stack one semiconductor die on another semiconductor die. For example, it may be advantageous to mount a memory device on a processor device. However, electrical interconnects must be established between the stacked dice. Several conventional techniques for stacking dice have been considered.

[0007] In one conventional variant, a relatively small semiconductor die is positioned on the bulk semiconductor side of a much larger semiconductor die. Bonding wires are used to establish the electrical conductivity between the upper die and the lower die. The difficulty associated with this approach is that the bonding wires tend to be relatively long electrical pathways and thus exhibit higher than desired inductance and proportionally slower electrical performance. In addition, the bulk semiconductor side is not available for heat sink mounting.

[0008] In another conventional variant, a relatively small die is flip-chip mounted on the bulk silicon side of a larger semiconductor die. Electrical interconnects between the upper and lower die are provided by a plurality of conductor traces that are formed on the bulk silicon side of the lower die. As with the first-mentioned conventional design, the conductor traces represent relatively high inductance pathways and thus limit speed performance. Furthermore, the bulk silicon side is not available for a heat sink.

[0009] In still another conventional design, a second die is mounted on the bulk silicon side of a larger die. Electrical interconnects between the two are established through a plurality of silicon vias. This design also requires a plurality of external traces and thus represents longer than desired electrical pathways for signal transfer. In addition, the bulk silicon side is not available for a heat sink.

[0010] A fourth conventional design consists of a first die upon which a couple of additional dice are positioned. The multiple smaller dice are electrically interconnected with each other and with the larger base die by way of a metal layer that is patterned on the base die and the two top-mounted dice. The metal layer is not unlike a larger scale version of a typical metallization layer used in a semiconductor die. In this regard, a dielectric layer is typically formed over the base die and the multiple top-mounted dice. The dielectric layer is lithographically patterned with openings to selected portions of the base and top-mounted dice. The metallization layer is thereafter deposited over the dielectric layer. This conventional technique requires a very high degree of die alignment, which is not always possible and thus may result in limited yields.
I can't tell whether this document proposes an improvement for the kind of die stacking that seems to be prevalent in handheld devices or whether it's in anticipation of the higher power GPUs we know and love so much in our desktop PCs.

A key feature of this is that the two dice are mounted face-to-face, so that they are "flip-chipped" to each other.

Any ideas?

The next question, then, is what would be the point of doing this in a desktop environment?

So is this purely for handheld?

Jawed
 
I always thought DRAM stacking would, in *THEORY*, be an attractive approach to reduce costs in the IGP and ultra-low-end discrete markets. Just put a 2Gbit GDDR5 chip (32-bit bus) on the same package (MCM or stack or anything really) and tadam, you've got enough memory bandwidth without needing a single (extra?) memory bus on the PCB. In effect, a super-sideport.

Given the mention of a heat sink, I doubt this is for handhelds. However, it could be a fairly generic patent they don't plan to apply.
 
According to this datasheet:

http://www.qimonda.com/download.jsp...sheets Graphics/IDGV51-05A1F1C_rev070_www.pdf

a GDDR5 chip is 14x12mm. If the GPU chip is ~17x17 it's basically a non-starter.

So stacking could only be feasible with un-packaged memory it seems, which I guess means custom and relatively small amounts of memory, not third-party off-the-shelf monstrous GDDR5.

Apart from anything else stacking has to be downwardly scalable, e.g. to GPUs in the region of 120mm2 with smaller GPUs losing whatever performance benefit might accrue from stacked memory (and such smaller GPUs may entirely disappear anyway due to the rising tide of IGPs).

So the next question is, what would you do with a blob of memory, say 32 or 64MB that has a reasonably fast connection to the GPU. How fast does that connection have to be? Is the kind of command interface and partially-obfuscated banking of a third-party memory chip going to bring enough performance? Why not just use standard GDDR5?

So it seems to me this could only really work if the stacked chip was essentially designed by the IHV too, so that it could fit in tightly with the GPU architecture. Maybe have a variable number of these sub-chips depending on the capability of the main chip. And presumably have an architecture that's going to demand these sub-chips for years to come.

The other issue is then to do with process. What kind of process is the memory built upon? Why not just use EDRAM for the main chip?

So, overall, I have to say it doesn't look at all compelling.

Jawed
 
So stacking could only be feasible with un-packaged memory it seems, which I guess means custom and relatively small amounts of memory, not third-party off-the-shelf monstrous GDDR5.
If they used GDDR5 then physically the only thing custom about it is that it would be flip chip mounted on a die instead of a normal FC substrate. I don't know if DRAM manufacturers do testing/repair at wafer level or after FC mounting though ... Known Good Dies are pretty much a necessity.

If they already do wafer level testing/repair I doubt they would charge a large premium for bare dies over BGA chips, it saves them money after all.
 
If ever, I expect to see package stacking rather than die stacking. With Package-on-Package (POP), the bottom substrate has balls on either side. The top package is simply soldered right on top of the bottom package.

This way, you can still source from multiple vendors while still having (some of) the compactness of die stacking.
 
If ever, I expect to see package stacking rather than die stacking. With Package-on-Package (POP), the bottom substrate has balls on either side. The top package is simply soldered right on top of the bottom package.

This way, you can still source from multiple vendors while still having (some of) the compactness of die stacking.

But that is only an option for low power devices as you can´t mount a heat sink in an efficient way on that package.
 
A couple of good articles on the subject in a recent IBM RD (with pretty pictures :mrgreen:)

http://www.research.ibm.com/journal/rd52-6.html

The rationale for 3d-stacking on CPUs was pretty interesting.

Cache capacity and bandwidth

It is known heuristically that the cache-miss rate is proportional to the reciprocal of some root of the cache capacity. Recently, it was shown that this is because the probability of re-references to data is driven by a power law [31]. For many workloads, the square root is a good fit. Data in a cache is stored as cache lines, which are contiguous sequences of bytes, aligned on boundaries of some power of 2, for example, 64 bytes or 128 bytes, or even more. The reason for this is that reference patterns follow two orthogonal localities, so large chunks (cache lines) are used to capture these behaviors.

Spatial locality of reference is the phenomenon that if a program references a particular datum, then it is extremely likely that the program will also reference other data that are spatially close (by address) to the referenced datum; that is, it will reference nearby data. This is the rationale for bringing in 64 or 128 bytes instead of merely words (4 bytes). Misses caused by near-future references to nearby data (within 64 or 128 bytes) are obviated. Temporal locality of reference is the phenomenon that if a program references a particular datum, then it is extremely likely to re-reference that same datum again in the near future. This locality is the rationale for caching data in the first place; if it were not inherent to programs, caches would not work. Most real workloads exhibit both behaviors simultaneously.

Choosing the right line size in a cache is then an optimization problem in which spatial and temporal localities are traded against each other within a fixed-capacity cache. If a cache is partitioned into many short lines, the fact that there are many of them will capitalize on the temporal aspects of the reference behavior. If the cache is partitioned instead into a smaller number of larger lines, then less temporal context can be captured, but the longer lines provide more spatial context.

However, note that there are other effects of the choice of line size. A cache directory must have a directory entry per line. If the size of the cache is doubled, then the choice is either to keep the size of the directory (therefore, its tractability) intact by doubling the line size or to keep the line size intact and double the size of the directory. Since directory access time is considered sacrosanct in many machines, the trend is toward larger line sizes as cache sizes grow. (In reality, cache line sizes are not generally growing in commercial systems because of contention and potential software tuning problems. The argument just describes one of the incentives to grow the line size.)

An adverse effect of doubling the line size is that it will then take twice as long to transfer a line if the bandwidth is held constant. This can cause costly queuing delays if certain bus utilization thresholds are exceeded, since queuing delay has a severe nonlinearity as utilization increases. The time to transfer a line (measured in processor cycles) is called the “trailing edge” (TE) of a miss and is equal to the number of packets in the line (which is the line size divided by the bus width) times the transfer rate (processor clocks per bus clock). Note that the bus utilization is proportional to the TE, since from the perspective of the bus, the TE is the service time.

In fact, when measuring the performance of a system, the bus bandwidth and the cache capacity manifest as each other and are “mutually fungible” in the following obvious way. To the extent that a cache can be made larger, less bandwidth is required to sustain the cache contents, since those contents will persist longer in the cache. In addition, to the extent that the bandwidth can be increased, less cache capacity is required, since the bandwidth enables the cache to be more facile at quickly importing its contents.

However, because of the power law that governs the miss rate as a function of capacity, this “fungibility” between bandwidth and capacity is disproportionate, as we show below. The reason that this is an impending problem is that we are nearing reasonable electrical off-chip bandwidth limits (and these are area and power limits) and will likely hit them in a generation or two. Further down the road, optics has the potential to provide some relief, but it will likely not arrive in time to avoid the problem.

If we use T to represent performance (where T is the number of threads at a fixed speed), B to represent off-chip bandwidth, and C to represent cache capacity, the impending question will be, “If B cannot increase, what must we do to C to be able to double T (e.g., by adding cores, or threads, or virtual images, which is clearly the trend)?” Figure 5 shows the situation with T threads, B bandwidth, and C cache.

emma5.gif

Figure 5

If we double T by merely putting two copies of this system on the same chip to get 2T (see Figure 6), then we also double C (to get 2C), and we double B (to get 2B). However, if we are at a bandwidth limit, we must hold B fixed. This means that we need to cut the original B in half. Because the miss rate varies as the square root of the cache capacity, cutting B in half requires quadrupling C for each {T, B, C}, in order to get two copies of {T, B/2, 4C}.

emma6.jpg
Figure 6

This means that when we are bandwidth limited, doubling T requires scaling the cache capacity by a factor of 8 (i.e., by 2 × 4C). Since the apparent trends in future scaling all involve increasing the number of threads, and since we are nearing bandwidth limits, this implies that we need a technology that can integrate more storage at an extremely high rate of scaling. This is quite formidable and beyond the capacity of technology that follows Moore's Law.
 
So the next question is, what would you do with a blob of memory, say 32 or 64MB that has a reasonably fast connection to the GPU. How fast does that connection have to be? Is the kind of command interface and partially-obfuscated banking of a third-party memory chip going to bring enough performance? Why not just use standard GDDR5?
If you're stacking DRAM then you're better off with a custom design with much higher bandwidth than an external solution considering that you can basically make the bus as wide as you want. Once you've got it there you can use it in the driver as a pool of faster memory for storing pretty much everything; for example you could stick all the render targets that you can fit in there. That would also scale with lower-end solutions with less stacked DRAM where you would see larger performance drops as the resolution goes up and one or more of the RTs have to be moved to the external memory.
 
But that is only an option for low power devices as you can´t mount a heat sink in an efficient way on that package.
Sure, but it may just work for an IGP (or not... It's probably not too hard to find some thermal numbers for such things.)
 
If you're stacking DRAM then you're better off with a custom design with much higher bandwidth than an external solution considering that you can basically make the bus as wide as you want. Once you've got it there you can use it in the driver as a pool of faster memory for storing pretty much everything; for example you could stick all the render targets that you can fit in there. That would also scale with lower-end solutions with less stacked DRAM where you would see larger performance drops as the resolution goes up and one or more of the RTs have to be moved to the external memory.
But why use die-stacking (which is effectively "de-integration") to achieve this instead of simply making the die larger?

OK, so you've built GT300 and it's 625mm2, TSMC can't make it any bigger (I'm guessing about the maximum die size), so you want to stack some memory. Fair enough.

But is a die-stack-based GPU design scalable? As soon as you want to scale-down GT300 or optically-shrink it, you want to re-integrate that blob of memory.

One partial solution to the area problem and thence the scalability problem is that memory, being low-power, is amenable to self-stacking.

So, instead of 128MB of "GDDR5" costing ~100mm2, it could be made 10mm2, say as a stack of 10 layers. This macro-stack of memory can then be stacked upon a GPU and since it's considerably smaller than the GPU, the GPU-scaling problem disappears.

Of course this still presumes that the macro-stack is able to confer considerably higher performance for the GPU than traditionally configured GDDR5.

Jawed
 
But why use die-stacking (which is effectively "de-integration") to achieve this instead of simply making the die larger?

OK, so you've built GT300 and it's 625mm2, TSMC can't make it any bigger (I'm guessing about the maximum die size), so you want to stack some memory. Fair enough.
600-700 mm2 is the neighborhood that most large monolithic chips top out at.

Intel's Merced just about maxed the optical reticle at ~600 mm2, but a later Itanium edged higher, indicating there might have been some slack depending on the linear dimensions of the chip, or that later lithography tools had expanded reticle limits.

But is a die-stack-based GPU design scalable? As soon as you want to scale-down GT300 or optically-shrink it, you want to re-integrate that blob of memory.
It might be undesirable to do anything to scale the memory all that frequently. Multilayer die stacks are going to multiply defect rates, so the incremental maturation of a die's manufacturing steps becomes all the more important.
Perhaps it would be better to have a few standard footprint memory stacks, and then cut or bloat a GPU to fit.

One possible scheme to increase tolerance would to structure each memory die as a grid of stitched-together memory sectors linked by a communications grid with redundant pathways. That way, the memory die can be scaled in size by lopping off sectors, and faults can be routed around.

This memory stack would either be topped with a specialized fault-tolerant switch network or the GPU's memory controller will have some extra area devoted to selectively disabling certain portions of the memory stack and network.

One partial solution to the area problem and thence the scalability problem is that memory, being low-power, is amenable to self-stacking.
To a point.
High performance DRAMs on GPU boards have become somewhat borderline with regards to power, and those are not stacked with each other and then stacked under a 100+ watt GPU.

On the other hand, they wouldn't need to same I/O drivers and physical interfaces that board-mounted DRAM would require.

Stacking could potentially allow for more complex memory schemes or more simultaneously open DRAM banks, but these would also increase power dissipation concerns.

So, instead of 128MB of "GDDR5" costing ~100mm2, it could be made 10mm2, say as a stack of 10 layers. This macro-stack of memory can then be stacked upon a GPU and since it's considerably smaller than the GPU, the GPU-scaling problem disappears.

Of course this still presumes that the macro-stack is able to confer considerably higher performance for the GPU than traditionally configured GDDR5.
The relative cheapness of a 10mm2 GDDR5 chip could be counteracted by the reduced yield in stacking, additional manufacturing steps, the lower volumes of a rather niche product, and possibly freakish power and connection requirements.

A 10mm2 GDDR5 chip of equivalent performance to a full-sized variant is going to need a significant fraction of the power and I/O pinout in 1/10th the space.

If stacked with a GPU, we find greater problems.
As Nvidia has shown, GPUs are not in a comfortable place when it comes to getting a full share of current and voltage through their pads.
What happens when a GPU must suck its power through the same pads as its memory stack, then run it through the longer traces passing through the memory as well?

Mechanical concerns with thermal cycling and problems with layers delaminating, shearing, and solder balls fracturing are likely to get worse as well.
Differing thermal expansion over a stack is something this patent doesn't concern itself about.
 
High performance DRAMs on GPU boards have become somewhat borderline with regards to power[...]
They have? GDDR on recent graphics cards isn't heatsinked as far as I can tell. They dissipate in the region of 5W for a die that's at least as big as RV740 which itself seems to be in the 50W+ range...

Stacking could potentially allow for more complex memory schemes or more simultaneously open DRAM banks, but these would also increase power dissipation concerns.
It seems to me that this kind of configuration is a prerequisite for stacking - there's just no point otherwise. And the radically reduced spacing of GPU and memory in this configuration would produce a significant power-saving for this kind of memory as compared with it being located in a standard DDR-style package across the PCB.

The relative cheapness of a 10mm2 GDDR5 chip could be counteracted by the reduced yield in stacking, additional manufacturing steps, the lower volumes of a rather niche product, and possibly freakish power and connection requirements.
Niche? GDDR5 is currently only in use by AMD. That's niche already! Surely it can stand being "niched" further :LOL:

My understanding of GDDR is that it's DDR memory technology with controller and I/O features specific to the latency-tolerant GPU view of the world.

Internally is a GDDR chip a single die? Or is it a memory die and a controller+I/O die?

A 10mm2 GDDR5 chip of equivalent performance to a full-sized variant is going to need a significant fraction of the power and I/O pinout in 1/10th the space.
A stack of these dies might use through silicon vias, which would be extremely dense (100,000 TSVs per square centimetre can be done). Packaged and then mounted on a GPU, sure, there's a ball density question, I agree.

If stacked with a GPU, we find greater problems.
As Nvidia has shown, GPUs are not in a comfortable place when it comes to getting a full share of current and voltage through their pads.
What happens when a GPU must suck its power through the same pads as its memory stack, then run it through the longer traces passing through the memory as well?
Stacking another die upon a GPU should lower the GPU's power consumption. The alternative is either a larger die for the GPU or to put that functionality on the other side of an interface which consumes significantly more power as well as needing a more intricate (and deeply buffered) controller interface.

So yeah, a GPU has to supply power to its "parasitic memory" but total power should be lower than a configuration without that stacked die.

Mechanical concerns with thermal cycling and problems with layers delaminating, shearing, and solder balls fracturing are likely to get worse as well.
Differing thermal expansion over a stack is something this patent doesn't concern itself about.
While I was skim-reading the patent document I was suprised to see that subject glossed over. But, of course, that's what patents do...

Jawed
 
They have? GDDR on recent graphics cards isn't heatsinked as far as I can tell. They dissipate in the region of 5W for a die that's at least as big as RV740 which itself seems to be in the 50W+ range...
Per-chip or in aggregate?
I'd say 5 W is close to the line.

Some of the original Pentium chips were ~11 W and needed a heatsink.
I've googled references to devices with 5 W dissipation needing heatsinks, typically in situations that are more constrained than what graphics boards experience, such as little air flow and wider temperature ranges.

*edit: Of course, stacking means 0 airflow and 100+ watts millimeters away...

Niche? GDDR5 is currently only in use by AMD. That's niche already! Surely it can stand being "niched" further :LOL:
Let's ask Qimonda how that went...

Internally is a GDDR chip a single die? Or is it a memory die and a controller+I/O die?
I don't have any die shots of GDDR5, but I think it's all one piece of silicon.

A stack of these dies might use through silicon vias, which would be extremely dense (100,000 TSVs per square centimetre can be done). Packaged and then mounted on a GPU, sure, there's a ball density question, I agree.
Another consideration is the current carrying capacity and mechanical strength of a TSV. The ones that feed the GPU with current could be a bit stockier than the signalling vias.

Stacking another die upon a GPU should lower the GPU's power consumption.
It reduces I/O power, but it also means there's a an insulated multi-watt heater sitting under the GPU die.
Minimum die temperature throughout the GPU would be higher be default, and temperature also has an effect on power consumption and leakage.
Also, chips do pass a fraction of their heat down to the PCB through their pins, and the stacked chips are in the way.

So yes, the I/O power goes down, but how much does that consume of a GPU's power budget versus incrementally increasing the leakage of the entire chip?

So yeah, a GPU has to supply power to its "parasitic memory" but total power should be lower than a configuration without that stacked die.
It's a fair amount power that is more concentrated.
Given how active GPU silicon is, and how much it leaks, incremental increases to a GPU's temperature can lead to a decent amount of leakage increase.
The GPU is the major contributor, so even small increases there can match savings on the minority portion of the power budget.
Of course, if the stacked memory also helps increase efficiency in execution per watt, it would go back to being a net win in that regard.

While I was skim-reading the patent document I was suprised to see that subject glossed over. But, of course, that's what patents do...
Sun researched a different kind of chip-stacking, but I think mechanical factors due to thermal cycling played a role in stoppering it.

I think a more in-depth characterization of the mechanical stresses and material factors would be needed.
Would TSVs weaken the silicon they pass through?
Would stacked chips lessen or increase the mechanical issues with substrate and GPU thermal stresses?
 
Last edited by a moderator:
I think a more in-depth characterization of the mechanical stresses and material factors would be needed.
Would TSVs weaken the silicon they pass through?
Would stacked chips lessen or increase the mechanical issues with substrate and GPU thermal stresses?
After reading that IBM page that was linked earlier I'm left feeling that stacking for large, high-power, chips such as desktop CPUs and GPUs is not going to happen any time soon. I can well imagine that optical interconnects and perhaps even optical intra-die will come first :p

Jawed
 
GDDR5 was a higher-margin product line, but in the end insufficient to support the DRAM maker's full operations in a period of oversupply of standard DRAM components.

The niche was not enough to save Qimonda from insolvency.

In the context of this thread, I wouldn't expect a very complicated niche of a niche product to do it many favors.

Qimonda's hoping buried wordline technology will make it cost-competitive, and it would have been a technology that would have benefited the full device portfolio. If we could magically change history, I wonder if it would have been better off if buried wordline had come out when GDDR5 did, and GDDR5 were still a technology pending deployment.
 
if dram (or anything for that matter) was stacked ontop on a GPU wouldn't thermals become a huge problem? It seems to me that that would mean a lot of power per mm^3 to be dissipated.
 
Back
Top