Answer to Future Memory bandwidth problems.

iwod

Newcomer
I have always wonder how we get more Gfx performance when we will soon be hitting the "memory wall"

SoC, or iGPU are even more bandwidth limited. But looks like something will come to the rescue.



Micron demonstrated a prototype of a possible future DRAM technology at the Hot Chips conference.

ZoomCalled Hybrid Memory Cube (HMC), the technology represents a logic layer with a stack of memory chips that are vertically connected with through silicon vias (TSVs). According to Micron, the number of contacts as well as short distances enable dramatically higher data transfer rates than today's memory architecture. The prototype shown at Hot Chips was rated at 128 GBps.

In comparison, current DDR3-1600 devices deliver 12.8 GBps. Micron claims that a single HMC could deliver about 20 times the bandwidth of a DDR3 module, while it consumes substantially less energy - only 10 percent of the energy per bit that DDR3 uses. According to the manufacturer, the architecture also requires about 90 percent less space than current RDIMMs.

Micron does not provide any information when HMCs will be available for purchase, but it pitches the technology as a way to break through the "memory wall", which is a term that commonly refers to the relatively small gains in memory efficiency and performance gains. The memory is designed to be used either to be used in close proximity to the CPU in performance-based systems or as far memory in systems that are built for better power efficiency.

http://www.tomshardware.com/news/hybrid-memory-cube-micron-stacked-memory,13277.html
 
Stacking ICs isn't exactly new - nor is through silicon vias. The thing is with this tech - you can't (easily) use it for general computing because it's not expandable, and stacking ICs on top of each other complicates cooling a lot.

You can't really put RAM on top of a 300W GPU, it'd cook itself, and you can't really put it underneath either because it has so many power and ground connections there'd be precious little room for any RAM. Any chip sitting underneath such a processor would look like swiss cheese with hundreds of vias drilled through it, I can imagine fitting actual logic structures inbetween such constraints could be a bit frustrating for designers. Also, I'd imagine you'd need to custom-craft RAM for each processor as not every CPU or GPU would have the same amount of connections or have them located in the same places...

It's a neat idea, but until someone figures out how to use it in anything but mobile applications it's going to remain fairly niche tech.
 
So this will be 5 years away for how long?

Similar stuff is already in use in cellphones and the like. The problems are mostly related to dissipating heat from a high-end chip through a stack of ram. In principle, this could be picked up by any gpu or cpu manufacturer today if they felt the drawbacks were worth it.

I think that it might have a future in the next generation of consoles -- as it provides much of the same benefits as eDRAM, but is a *lot* cheaper.
 
Similar stuff is already in use in cellphones and the like.
Hmm? I've never heard of any *shipping* cellphone/tablet/whatever with TSV including logic chips (unless you define SiP or PoP as TSV which is just confusing for no good reason). I think there's some TSV for memory-only stuff as well as some stuff with silicon interposers although I might still have trouble naming any shipping commercial end-product with it.

Sadly the reality is that the first generation of TSV will be very expensive and won't actually save as much power as some of the hype may imply. Look at the graph on Slide 9 of this presentation: http://www.cadence.com/cdnlive/library/Documents/2011/EMEA/MEM07_FreundC_STEricsson_PUBLIC.pdf - yay, we just went from 1200mW to 600mW for the same amount of bandwidth if we include the memory controller. That's slightly better than the usual generation jump but it's hardly mind blowing.

I had a very good conversation with an engineer from SPMT (a hybrid serial memory standard competing with WideIO for handhelds) back in February at MWC11, and he was honest in that he did believe WideIO was definitely going to happen and it did have momentum, but if they can get some big names on board besides Marvell, based on everything I know I'm confident they can be a very successful cost-centric competitor to WideIO by sticking to PoP instead of silicon interposers or TSV.

---

I could see this as making a lot of sense for next-gen consoles if it wasn't for the fact logic+memory TSV in that power range definitely won't be mature enough by then. Memory-only TSV and/or silicon interposers could be interesting and commercially viable but not quite as revolutionary.
 
Even if the bw/W doesn't decrease dramatically, it should make a reasonable dent in latency, right? If yes, then I'd argue that this change is well worth it since mem latency is the biggest bottleneck anyway.
 
Even if the bw/W doesn't decrease dramatically, it should make a reasonable dent in latency, right? If yes, then I'd argue that this change is well worth it since mem latency is the biggest bottleneck anyway.
TSV in itself can definitely help a lot for latency, but Wide I/O specifically won't really help much. Look at this article: http://chipdesignmag.com/sld/blog/2011/08/25/will-wide-io-reduce-cache/

The first two guys don't seem very specific, but the last one (Steve Hamilton) makes three very good points it seems to me:
1) To minimise latency you want a custom memory chip to match the floorplan of the SoC. This obviously won't be the case in a standards-based product.
2) Wide I/O might use a 512-bit bus but the initial standard is still only 200MHz SDR.
3) The DRAM latency itself doesn't change.

Now I can think of a few counter arguments, such as:
1) Couldn't you position the CPU (which is the most latency-sensitive part) so as to minimise latency for the TSV memory? Or does SoC position actually not matter given how the standard works? Is it even ever possible to really position such a small block as the CPU to have fast access over the entire DRAMs given how much larger the DRAMs are? 2) LPDDR1 is 400MHz effective, LPDDR2 is 800MHz effective, LPDDR3 is 1600MHz effective - it seems to me all of those numbers are really just 200MHz raw. So in terms of latency this shouldn't really matter unless you're bandwidth constrained (which isn't the point).
3) I'm not sure why he's implying that an eDRAM array itself is inherently lower latency than a DRAM array, although I suppose it's true memory manufacturers generally optimise for cost (i.e. density) while meeting spec whereas eDRAM tends to replace large SRAM arrays where latency does matter to some degree.

So errr, yeah, conclusion I'm not completely sure (and it's telling I can't find any solid information), I'd expect an improvement but nothing mind blowing. Also on the specific question of cache that this article addresses, mobile caches are as much if not more about saving power than improving performance, and they'll still be as useful for that.
 
As long as you have enough parallelism and transistors you can keep exchanging bandwidth for latency (moar threads). Number of transistors is still growing, we have plenty of parallelism left ... bandwidth is holding us back.
 
Last edited by a moderator:
What about the cost proposition of using eDRAM over TSV? That should improve latency and bw a lot. I guess someone like Apple could try it, they have the lowest BoM anyway.
 
As long as you have enough parallelism and transistors you can keep exchanging bandwidth for latency (moar threads). Number of transistors is still growing, we have plenty of parallelism left ... bandwidth is holding us back.
Oh yes, I was obviously only considering the CPU part of the SoC. And even then with the Cortex-A15's OoOE becoming quite advanced and ARM having confirmed that future generations will eventually add SMT, by the time anything fancy like this is a commercially viable option it probably won't make as much of a difference anymore...
rpg.314 said:
What about the cost proposition of using eDRAM over TSV? That should improve latency and bw a lot. I guess someone like Apple could try it, they have the lowest BoM anyway.
If you're using eDRAM without doing something fancy ala Xenos with on-chip ROPs, there's obviously no point. eDRAM is clearly much more expensive per bit than regular DRAM on a DRAM-only process. But you're obviously completely right that Apple is in an extremely good position to switch to proprietary memory chips with TSV instead of standard Wide I/O in the future, in order to be able to match the floorplans to each other. I'm not sure how much of an advantage it would give them, but they're the only ones in a position to do it in handhelds and it'd certainly be very interesting.
 
WideI/O leaves a lot to be desired, for technical and practical reasons, and Micron's approach is a substantial improvement.

The problem with WideIO is that it relies on using a wide, DRAM-optimized interface directly between logic and DRAM. This is good because it means that the approach is fundamentally compatible with existing DRAM process technology, and will not substantially add cost (which is critical to achieving good economics).

However, the problem is that a DRAM-optimized interface is not good for logic. DRAM transistors are designed to be very low leakage and do not operate efficiently at high switching speeds. Logic processes are designed for efficient, high performance transistors with reasonable leakage. So using a wide and slow interface is very inefficient because you end up spending far too many pins and too much power. For logic, interfaces like PCI-E, QPI, HT and FB-DIMMs (IBM uses a variant of FBD as well) are much more optimal - notice how these are all 5-8GT/s interfaces compared to ~1.6GT/s (for similar trace length DDR3; I'm ignoring GDDRx because GDDRx is optimized for much shorter trace lengths).

The elegance of Micron's approach is that it is does not impose any extra costs on either the DRAM (in the force of requiring fast transistors) or logic (in the form of requiring too many pins). In fact, it is conceptually akin to a package level version of FBD. It uses high speed, narrow links between the logic and a bridge chip. Then a wide interface runs between the bridge chip and the DRAM.

The bridge chip itself is implemented in an older logic process, so that it can run at high speed. It is also a much cheaper piece of silicon and can afford to spend area on pins for a wide DRAM interface. This acts as a very logical decoupling mechanism and avoids imposing costs on either the logic or the DRAM. In contrast, WideIO requires spending far too many pins to achieve a reasonable bandwidth.

In terms of actual status, WideIO is not remotely ready for prime time. There was a Samsung paper at ISSCC that indicated ~57% yields for a WideIO test chip. So both proposals are still in the R&D phase and quite a way from commercialization. Also, WideIO isn't really 3D - it's merely stacking a single DRAM layer on top of logic - call it 2.5D : ) Micron's proposal is about stacking MANY DRAMs on top of logic, so you get to fully exploit the 3rd dimension.

I am pretty confident that none of the high performance logic manufacturers will consider WideIO - including AMD, IBM, intel and Oracle. That's problematic, as they collectively represent the biggest market for DRAM.

OTOH, they are far more likely to consider Micron's approach. So I see a lot more promise there.

David
 
Would there be any feasibility in manufacturing something like Micron's bridge chip directly on the same wafer as that which will host the DRAM? IE, running the same wafer twice through a fab, first etching one structure, say the interface, using the older process you mentioned, then the DRAM, and connecting the two in the wiring stage of manufacturing...

Nobody's done stuff like that in the past?
 
Thanks David, that's a very nice explanation of Micron's approach! It does sound like a good idea, I'm surprised they'd get such massive gains from it though - 90% is quite a lot! Then again they're probably comparing themselves to 1.5v DDR3 whereas LPDDR2 needs to compete 1.2v LPDDR2. It's interesting that voltage scaling seems to be slowing down even more for DRAM by the way as LPDDR3 and Wide I/O are both stuck at 1.2v while DDR4 is also at 1.2v and *might* eventually get to 1.1v or 1.05v in ultra-low-voltage versions. Then again it might mostly be LPDDR2 which was reducing it unusually fast with 1.2v...

In terms of actual status, WideIO is not remotely ready for prime time. There was a Samsung paper at ISSCC that indicated ~57% yields for a WideIO test chip.
Is that for the chip or the full package? I can't see what could be the problem for the chip itself. Also is that silicon interposers or TSV? I've heard the first implementations might be interposer-based. I'm not sure how it compares in terms of power, but supposedly it's more expensive than TSV in theory, but of course it's cheaper in reality until TSV is mature enough.

I am pretty confident that none of the high performance logic manufacturers will consider WideIO - including AMD, IBM, intel and Oracle. That's problematic, as they collectively represent the biggest market for DRAM.
Oh, absolutely. Wide I/O has always been targeted primarily as a LPDDR2 replacement though, so I doubt they care. Maybe they had ambitions to expand beyond the initial version but I wouldn't know. There's no need for a stack when a single 8Gbit chip is enough. Of course, is it enough, and is it cost effective? ST-Ericsson seems to want to add an extra LPDDR2 channel to that for memory capacity reasons, which seems insane to me, but whatever. Frankly given all the trade-offs with Wide I/O I'd much rather see the mobile industry move more towards SPMT, but then again Wide I/O is a good proving ground for TSV.

OTOH, they are far more likely to consider Micron's approach. So I see a lot more promise there.
It does seem like a good approach. In the past bridge chips required Yet Another Package, but with TSV it starts making a lot of sense. Let's see if JEDEC or someone else picks up the idea.

Grall: Even if that was possible, it would be more expensive for the DRAM itself and make it impossible (or much harder) to stack multiple DRAM dies.
 
So using a wide and slow interface is very inefficient because you end up spending far too many pins and too much power.
Without numbers to prove this I seriously doubt it. The number of pins for wide-I/O might be large, but compared to the actual number of transistors it's still tiny. So needing a bit more transistors to deal with the slower data per pin seems to me unlikely to matter.
For logic, interfaces like PCI-E, QPI, HT and FB-DIMMs (IBM uses a variant of FBD as well) are much more optimal - notice how these are all 5-8GT/s interfaces compared to ~1.6GT/s (for similar trace length DDR3; I'm ignoring GDDRx because GDDRx is optimized for much shorter trace lengths).
They are all not designed for mobile applications either, while Wide I/O is.
I am pretty confident that none of the high performance logic manufacturers will consider WideIO - including AMD, IBM, intel and Oracle. That's problematic, as they collectively represent the biggest market for DRAM.
This is completely non sequitur, it wasn't developed for the markets these guys are in.

All this is completely off topic to the forum, neither Wide I/O nor Micron's memory offers any opportunity to boost GPU memory speed ... they both use some technology which could be useful for GPU memory, but the standards an sich are useless for it.
 
It seems Intel just threw it's lot in with Micron.

http://www.anandtech.com/show/4819/...lop-hybrid-memory-cube-stacked-dram-is-coming

The question of heat dissipation is still unresolved, but it could rewrite mobile systems.
I read this two reports (in French still you can read the slides :) ) to me it's unclear where the tech is ahead. The form factor is not definitive for sure but the thing seem to trade bandwidth for capacity. I wonder about the capacity of those "cubes" it's gen1 proptotype but 512MB is low.
Some data:
DRR3-1333 / 4GB / 4.6 Watts / 10.7 GB/s
cube gen1 / 512 MB / 8 Watts / 128 GB/s

I wonder about how GDDR5 would compare for example.
 
Last edited by a moderator:
3d chip stacking is undoubtedly one of the next major advances in the semiconductor industry. I think it will become a major player in chip packaging if not THE major player within the next decade. There are of course hurdles to overcome but the benefits in bandwidth and performance are substantial and the problems are gradually being solved like everything else in the industry. Cost is an initial concern, but this is the case for every new technology. As it becomes mainstream, continual engineering and competition will make it cost effective. See the following article for an example:

http://www.engadget.com/2011/09/07/ibm-and-3m-join-forces-to-fab-3d-microchips-create-mini-silicon/
 
How do you dissipate heat on those things ?
One way is to build miniature heatpipes or other heat-transfer utilities inside those chips. You can't rely on only transferring heat through external contact area like with current chips.
 
Back
Top