AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

As the separate stacks would sit in very close proximity on the same logic die, I wouldn't expect much of a timing difference.
So this is standard stacks with standard base layers, that are themselves stacked on a die, that is set on the interposer rather than some kind of custom larger base die.

And the signals from the GPU have to run through the PHYs on the base die anyway. Routing the data and address lines to the TSV contacts of the second stack just a few millimeters (the size of a 2 GBit die) away, is probably costing not much more time than the signals would travel inside a larger 4 GBit die if you adress another bank there.
It is physically a different chip in a standard that does not promise that independent channels are necessarily in sync. The link die could impose some additional synchronization between stacks that operate as if they are alone, such as a possible corner case with refresh timings shifting between halves of the same channel.
The standard shouldn't care, but the memory controller would stand to benefit from being aware of possible differences in things like bank activation limits.
 
Perhaps "dual link interposing" means that there's a small interposer between the GPU interposer and a pair of HBM stacks, which is only present for 8GB HBM. With a 4GB HBM configuration, the secondary interposer is not present and each HBM stack interfaces directly to the GPU interposer.

The secondary interposer is dumb, but each base die in the pair of stacks is now connected to its mate and there's a protocol for these two dies to share the 1024-bit bus back to the GPU.
 
But what's stopping GCN 1.1/1.2 from being 12_0?
Lack of proper conservative rasterization support?
Oops, that's 12_1 already.
So this is standard stacks with standard base layers, that are themselves stacked on a die, that is set on the interposer rather than some kind of custom larger base die.
No, I meant 4 memory dies stacked on a shared custom base die (stacked on the Si interposer). You basically distribute half of the banks of a channel to a different die. The clocking would be of course synchronized.
If that is not good enough, it should be possible with low effort on the base die to group the two channels (with 8 banks each) from each 2 GBit die to a virtual 128bit channel with 16 banks. In that case you stay on the same die so your timing concerns should mostly vanish. Channel 0-3 as seen from the GPU would be on one stack, channel 4 to 7 on the other one. It would basically mimic an 8Hi stack (or two 4channel 4 Hi stacks) just that one implements it as two 4Hi stacks next to each other with a custom base die.

An active interposer handling this with two standard stacks (including a standard base die) would probably not be cost efficient considering the size will probably sit right at the reticle limit (26x32mm). Same goes for an additional layer underneath two standard stacks, albeit this remains a possibility. Another idea would be to daisy chain two HBM stacks on a single interface (with a similar splitting of the channels between the stacks, the first base die routes requests going to the upper half of the channels to the second base die, connections between the two stacks run through the interposer). Bould this would also necessitate a custom base die and would complicate the timing issues.
 
Last edited by a moderator:
It needs to be done in a fashion where GPU only sees 4x1024bit buses. IMO most likely theory is that 2 stacks of DRAM share same logic under them, instead of 8x1-stack it's 4x2-stack
 
Whatever the solution of dual link is ( we should discover it soon ), they have not developp it in 1 month as some article want to suggest it.
 
No, I meant 4 memory dies stacked on a shared custom base die (stacked on the Si interposer). You basically distribute half of the banks of a channel to a different die. The clocking would be of course synchronized.
Just to make sure I'm interpreting this correctly, this would be 2 4-hi stacks without a base layer stacked onto a custom shared base layer. The layers/dies/chips/stacks terms are flying around thick in this discussion.

I'm curious with the rise in channels and the potential for a compromised mounting on the interposer whether a salvage SKU could exist where there is a non-integral amount of memory (edit: non-multiple of 8 channels), rather than tossing a large amount of finished silicon at the end of the process.
We already have partially disabled GPUs without a full complement of memory channels. The oddity here would be that the chips would be physically there.
 
A stack might be better called a module, since the base die is integral to the memory dies above it. Each of the 4 memory dies in a stack has two channels, each of which has a 128-bit data bus, so the combination of 4 dies, 2 channels and 128-bits is 1024 bits.
 
Just to make sure I'm interpreting this correctly, this would be 2 4-hi stacks without a base layer stacked onto a custom shared base layer.
Yes.
The layers/dies/chips/stacks terms are flying around thick in this discussion.
That's why I tried to restrict myself to a somewhat consistent use of just dies and stacks ;).
I'm curious with the rise in channels and the potential for a compromised mounting on the interposer whether a salvage SKU could exist where there is a non-integral amount of memory (edit: non-multiple of 8 channels), rather than tossing a large amount of finished silicon at the end of the process.
We already have partially disabled GPUs without a full complement of memory channels. The oddity here would be that the chips would be physically there.
In principle, HBM allows for less than 8 channels in a stack. If that would make sense for a salvage model compared to using just 3 stacks, I don't know. But maybe someone will offer HBM stacks with let's say 6 channels and one die less at a lower price point than a fully featured 8 channel stack in the future, who knows?
 
Whatever the solution of dual link is ( we should discover it soon ), they have not developp it in 1 month as some article want to suggest it.

Given the lead times in manufacturing, even if development time was near-zero, manufacturing and integration into a product probably couldn't be done on such short notice either.
This seems like it could have been initiated earlier, and might represent a preliminary version of the expanded functionality that is part of HBM2 proper.
 
Why would you assemble a module from one or more known-bad memory dies? Every memory die is tested and only the good ones are picked for assembly into a stack. The idea that you find out after you've assembled the stack that it doesn't work is absurd.
 
Why would you assemble a module from one or more known-bad memory dies? Every memory die is tested and only the good ones are picked for assembly into a stack. The idea that you find out after you've assembled the stack that it doesn't work is absurd.
My scenario is a compromised mounting on the interposer. The stacks should be provided on a known-good basis (unless there's a value RAM option), but the mounting process itself is an additional step with some 0.xxxx% error rate.
At that point, why toss the interposer and anything else already mounted if it turns out one channel has a few bad bumps?
The same might happen if one memory channel out of 32 is dodgy on the GPU.
 
But maybe someone will offer HBM stacks with let's say 6 channels and one die less at a lower price point than a fully featured 8 channel stack in the future, who knows?

I was thinking more about salvage for defects in assembly, but that would be one way to salvage DRAM stacks, assuming the incremental cost for the sorting is outweighed by the revenue brought in.
 
The design of the HBM could be based one the PIM work they (AMD Research) did in 2014.. We discussed this a while ago, just can't remember/find which thread it was in :)

"Throughput-Oriented Programmable Processing in Memory" - AMD Research 2014


CAQELXtUcAEy9wM.png:large


CAQENnHUgAALjwr.png:large
 
Why would you assemble a module from one or more known-bad memory dies? Every memory die is tested and only the good ones are picked for assembly into a stack. The idea that you find out after you've assembled the stack that it doesn't work is absurd.
There is such a thing as packaging yield. (Usually very low single digits.) I don't know if there's an official term, but let's call it micro-bonding yield? I expect the latter to be worse than the former.
 
The design of the HBM could be based one the PIM work they (AMD Research) did in 2014.. We discussed this a while ago, just can't remember/find which thread it was in :)
The stacked memory may have come first. The paper references die stacking, HMC, and HBM as items in its recent past. HBM has slides going back as 2010.
 
Isn't the current information suggesting that all GCN's support Tier 3 Resource Binding (17% of Steam DX12 hardware* supporting it), they just miss some of the other DX12 stuff which Fiji would have

*DX12 hardware = hardware that supports DX12, even if it's limited to 11.x feature levels

IIRC, there's multiple items that are classified from Tier 1-3. Resource Binding is only one of them. While Hawaii (should be Hawai'i :p) has that for resource binding it likely lacks it for some others.

Regards,
SB
 
Back
Top