I never really understood why HBM 1.0 should be limited to 1GB per stack. It seems like something that should be purely a function of density, and independent of the revision.
For a card in this part of 2015, the concern may be more with the density listed for the parts announced as just being in production.
HBM has a 2KB page size and can address 2^16 row addresses, and at least for other DRAM types row and page are equivalent. That capacity per channel happens to give 1GB when accounting for two channels and a 4-high stack.
Changing page size within the spec could be doable, as page size has gone up for the densest GDDR5 chips and there seems to be room to grow for HBM in terms of many columns it can address per row and how many banks it can address.
Larger pages may have some undesirable effects. There was a proposed HBM gen2 mode that would attempt to reduce the impact of page size, and that size was also 2KB.
(source with address sizes and pseudo-channel mode
http://www.memcon.com/pdfs/proceedings2014/NET104.pdf )
The time frame for HBM2 seems like it could cut into the first gen's longevity, but with time there may be a bump in density.
I thought initially there was more steam in the effort to get to what appeared to be a more compelling HBM2, but on further review I do not see anything I can interpret as a never.
HBM2, is 4 or 8GB.. someone can enlight me if you need a 8192bit bus for HBM2.0. or it is just a question of the density of the stacked die ? (2Gb instead 1 ? )
I have not seen HBM2 spelled out to the extent HBM has, but it appears to introduce more capability for managing more internal banks, a longer burst length, and a higher stack height. The bus width doesn't appear to have changed, as projections for its bandwidth give it double bandwidth at the same time it doubles data rate per IO.
Page 8 from
AMD's PDF shows 8 stacks of HBM on GPU/APU interposer.
To be fair, that diagram has a Jaguar APU in the middle, so perhaps the rule of looking cool is in play.
HBM at ~512GB/s and Tonga style bandwidth efficiency should be in the region of 100% faster than Hawaii, if the overall design is balanced properly, e.g. with a count of say 96 CUs and 128 ROPs.
Perhaps we have to conclude that AMD decided to go with HBM despite not having access to a node that's better than 28nm.
HBM itself wouldn't free up the power budget to make the doubled hardware fit without other alterations.
It may also be that if there was a time to monetize the work put into Gen1 HBM it would be this gen of GPUs. AMD could be obligated to use the memory in some capacity since it roped Hynix in, and at this point AMD has spun off a fair amount of high-speed IO expertise, which may mean alternatives could be limited.
Can GCN scale up to 96 CUs, or is 64 the practical limit? Can GCN's cache architecture scale up to HBM?
Is AMD planning to put L3/ROPs/buffer caches in logic at the base of each stack, which happens on the next process node?
I would say that at least theoretically, the hardware would be able to distribute itself well enough in terms of shader engines and L2 slices per controller. HBM's biggest differentiator is a larger number of channels, but at least with 4 GB it's 32 channels versus Hawaii's 16, which may not be insurmountable.
The next process node, if it is one of the 16nm class, provides the hardware and power budget to do something with the spike in bandwidth. Doing something with the stacks themselves seems more tentative and long-term.
Maybe for AMD, HBM is cheaper, overall, than a traditional GDDR5 chip at this performance level. Seems unlikely though. Die area saved versus the cost of the interposer. Are AIBs going to eat the cost of HBM memory? Is 8GB of HBM going to be cheaper than 12GB of GDDR5?
I do not think the AIBs would want to absorb the cost directly, like they would for on-PCB DRAM. They wouldn't be the ones mounting the HBM on the interposer, so that package is someone else's problem.
Interposer + HBM are two big, risky, changes that can't arrive independently in discrete graphics. Being stuck on 28nm is a nightmare. But it's not the first time that discrete graphics has been stuck on a process, so AMD knew it was likely.
Anyway, dropping-in HBM to compete against a non-HBM architecture seems like solving the wrong problem.
HBM effectively assumes interposer without something more exotic taking its place.
The magnitude of the slip seems larger. 28nm took its time, and the likely next process represents two nodes' worth of a wait.
Which has an exacerbated appearance as NVIDIA's reported TDP's aren't actually what any of their cards run at during peak.
I was under the impression that the reference 980s did keep to the marketed TDPs, but that the third-party cards were permitted to alter the limits and there's not much interest in discussing it.
But making assumptions that HBM is way more expensive than GDDR5 is unwarranted. The R&D costs were certainly large, but there's no indication that the fab costs are grievously more. It's still silicon, same material, assumedly roughly the same patterning and etc. just with clever engineering to make it stack.
The shift to stacked memory is a very significant change, as is the interposer and an MCM-style integrated chip.
The silicon fabrication at a die level is well-understood.
The thinning needed for stacked DRAM, the TSVs, the cost of the interposer, and a significantly different set of thermal and mechanical concerns are a big change.
The manufacturing process is longer, there additional manufacturing costs, and getting something wrong can compromise a lot more silicon than it used to.
GDDR5 is an established standard with very mundane requirements relative to HBM, and the cost-adder of the new memory may keep HBM more niche than even the comparatively small volumes GDDR5 has versus the really cheap standard memory types.