AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

On top of that, they are including an AIO cooler as their "reference" design which is somewhat alarming (if the rumors are true of course). The leaked benchmarks indicate that the power consumption numbers are better or about the same to that of a R290X yet I can't see why the need for such cooling solution when it adds more cost and risk..

Because they dont want another R290 reference air cooler being as loud as a jet fiasco on their hands again. They haven't had any good reference coolers in ages. They'll likely spin this as giving their board partners room to differentiate their offerings.
 
I love your optimism, but there's no such thing as high-end filler chips, and quick cost reductions. Not even for a company that's flush with money like AMD.

High end on 28nm mid range on 16nm perhaps. Perhaps they just took a 16nm design and put it on 28nm. I don't know The timing for the chip seems really odd to me. I would think they'd want something new and powerful to drop with the vr head sets not months before but yet they can't really let NVidia have more time as the high end chips
 
I never really understood why HBM 1.0 should be limited to 1GB per stack. It seems like something that should be purely a function of density, and independent of the revision.

For a card in this part of 2015, the concern may be more with the density listed for the parts announced as just being in production.
HBM has a 2KB page size and can address 2^16 row addresses, and at least for other DRAM types row and page are equivalent. That capacity per channel happens to give 1GB when accounting for two channels and a 4-high stack.
Changing page size within the spec could be doable, as page size has gone up for the densest GDDR5 chips and there seems to be room to grow for HBM in terms of many columns it can address per row and how many banks it can address.
Larger pages may have some undesirable effects. There was a proposed HBM gen2 mode that would attempt to reduce the impact of page size, and that size was also 2KB.

(source with address sizes and pseudo-channel mode http://www.memcon.com/pdfs/proceedings2014/NET104.pdf )

The time frame for HBM2 seems like it could cut into the first gen's longevity, but with time there may be a bump in density.
I thought initially there was more steam in the effort to get to what appeared to be a more compelling HBM2, but on further review I do not see anything I can interpret as a never.

HBM2, is 4 or 8GB.. someone can enlight me if you need a 8192bit bus for HBM2.0. or it is just a question of the density of the stacked die ? (2Gb instead 1 ? )
I have not seen HBM2 spelled out to the extent HBM has, but it appears to introduce more capability for managing more internal banks, a longer burst length, and a higher stack height. The bus width doesn't appear to have changed, as projections for its bandwidth give it double bandwidth at the same time it doubles data rate per IO.

Page 8 from AMD's PDF shows 8 stacks of HBM on GPU/APU interposer.
To be fair, that diagram has a Jaguar APU in the middle, so perhaps the rule of looking cool is in play.

HBM at ~512GB/s and Tonga style bandwidth efficiency should be in the region of 100% faster than Hawaii, if the overall design is balanced properly, e.g. with a count of say 96 CUs and 128 ROPs.

Perhaps we have to conclude that AMD decided to go with HBM despite not having access to a node that's better than 28nm.
HBM itself wouldn't free up the power budget to make the doubled hardware fit without other alterations.
It may also be that if there was a time to monetize the work put into Gen1 HBM it would be this gen of GPUs. AMD could be obligated to use the memory in some capacity since it roped Hynix in, and at this point AMD has spun off a fair amount of high-speed IO expertise, which may mean alternatives could be limited.
Can GCN scale up to 96 CUs, or is 64 the practical limit? Can GCN's cache architecture scale up to HBM?
Is AMD planning to put L3/ROPs/buffer caches in logic at the base of each stack, which happens on the next process node?
I would say that at least theoretically, the hardware would be able to distribute itself well enough in terms of shader engines and L2 slices per controller. HBM's biggest differentiator is a larger number of channels, but at least with 4 GB it's 32 channels versus Hawaii's 16, which may not be insurmountable.
The next process node, if it is one of the 16nm class, provides the hardware and power budget to do something with the spike in bandwidth. Doing something with the stacks themselves seems more tentative and long-term.

Maybe for AMD, HBM is cheaper, overall, than a traditional GDDR5 chip at this performance level. Seems unlikely though. Die area saved versus the cost of the interposer. Are AIBs going to eat the cost of HBM memory? Is 8GB of HBM going to be cheaper than 12GB of GDDR5?
I do not think the AIBs would want to absorb the cost directly, like they would for on-PCB DRAM. They wouldn't be the ones mounting the HBM on the interposer, so that package is someone else's problem.

Interposer + HBM are two big, risky, changes that can't arrive independently in discrete graphics. Being stuck on 28nm is a nightmare. But it's not the first time that discrete graphics has been stuck on a process, so AMD knew it was likely.

Anyway, dropping-in HBM to compete against a non-HBM architecture seems like solving the wrong problem.
HBM effectively assumes interposer without something more exotic taking its place.
The magnitude of the slip seems larger. 28nm took its time, and the likely next process represents two nodes' worth of a wait.

Which has an exacerbated appearance as NVIDIA's reported TDP's aren't actually what any of their cards run at during peak.
I was under the impression that the reference 980s did keep to the marketed TDPs, but that the third-party cards were permitted to alter the limits and there's not much interest in discussing it.

But making assumptions that HBM is way more expensive than GDDR5 is unwarranted. The R&D costs were certainly large, but there's no indication that the fab costs are grievously more. It's still silicon, same material, assumedly roughly the same patterning and etc. just with clever engineering to make it stack.
The shift to stacked memory is a very significant change, as is the interposer and an MCM-style integrated chip.
The silicon fabrication at a die level is well-understood.
The thinning needed for stacked DRAM, the TSVs, the cost of the interposer, and a significantly different set of thermal and mechanical concerns are a big change.
The manufacturing process is longer, there additional manufacturing costs, and getting something wrong can compromise a lot more silicon than it used to.

GDDR5 is an established standard with very mundane requirements relative to HBM, and the cost-adder of the new memory may keep HBM more niche than even the comparatively small volumes GDDR5 has versus the really cheap standard memory types.
 
To summarize:

1 - 4096 ALUs, so 64 CUs at 1050MHz for 8.6 TFLOPS
2 - "Up to" 8GB HBM confirmed through the usage of a dual-link. HBM2 is mentioned, perhaps for a refresh.
3 - Default cooler is air-only. Watercooled version is a "premium" product (wouldn't be surprised if the 8GB version is watercooled by default, for example).
4 - 50-65% faster than R9 290X in 4K resolutions
5 - Fiji is DX12 Tier 3, but Hawaii is not.



Leaked slides from videocardz:

Nlc7jlt.jpg

ce05lDN.jpg

yPBxI49.jpg

S2UjMqF.jpg

xBePiVF.jpg
 
Wow. HBM is arriving sooner than I expected. I'm not well versed with stacks, but is there an issue with dual link-interposing? Nice to see that they categorized at Tier 3 DX12 as well - not sure what feature level that is though.
 
Exciting times ahead. This GPU (like Titan) will probably be priced out of my range, but I think the potential is incredible for performance once HBM starts to migrate down to the midtier and lower and we're on a new process node.
 
"World's first discrete GPU with full DirectX 12 implementation"

Presumably then the architecture has changed somewhat from Tonga (Radeon R9 285)
 
I don't quite follow the graphic for the dual-link pic, which has four 2GB blocks sitting around a TSV rectangle, but that aside it seems like the idea would be to have two stacks share the same lines to the GPU.
That seems like additional iteration over the originally proposed standard, as point to point seemed to be the topology, although I do not know the entirety of it.
If there wasn't a chip-select built into the standard, what is serving as the select? Extra bits in column or bank commands?
 
News reported on GAF, sourced from Germany
http://www.neogaf.com/forum/showpost.php?p=156138001&postcount=333


"The German online outlet Heise received some new information at the Cebit (German high tech convention) today.
I find them highly reliable and I've heard similar things from other companies currently at Cebit.

- April was rumored for release of 380(X) but now partners say it seems highly unlikely
- only the 390 and the 390X might come with the new HBM.
- 390X supposed to need 300 Watts+

Prices and Performance

- R9 390X ~ 700$+
- R9 390 ~ 700$
- R9 380X (faster than 290X) ~ 400$
- R9 380 (faster than 290) ~ 330$
- R9 370 (approx. between R9 270X and R9 285) ~200$
- R7 360X ~ 150$
- R7 360 ~ 110$


source (German): Heise CEBIT News"
 
Last edited:
I hate to be that guy, but that's very specific. Does that imply it is not the first GPU with full DirectX 12 implementation.
Perhaps Carrizo has the same GPU architecture and will be launched at the same event as Fiji.
Or perhaps Nvidia's Tegra X1 uses Maxwell 'gen3' and they are counting that.
 
Over $700 is really steep but if it beats Titan X then nVidia will be in trouble with their ridiculously-priced top-end graphics card.
 
I don't quite follow the graphic for the dual-link pic, which has four 2GB blocks sitting around a TSV rectangle, but that aside it seems like the idea would be to have two stacks share the same lines to the GPU.
That seems like additional iteration over the originally proposed standard, as point to point seemed to be the topology, although I do not know the entirety of it.
If there wasn't a chip-select built into the standard, what is serving as the select? Extra bits in column or bank commands?
I would think so.
The HBM spec prescribes 8 Banks for the 1 and 2 GBit per channel devices, going to 16 Banks with 4 GBit. Hynix could present their "dual link stacks" as 2 GBit per channel devices to the memory controller, but with 16 Banks to make the distinction to the normal "single link" stacks. The highest bit of the bank address would then be used by the logic base die to decide which stack to address. I guess using the bank adress this way would minimize the side effects compared to using the column address.
 
5 - Fiji is DX12 Tier 3, but Hawaii is not.
Isn't the current information suggesting that all GCN's support Tier 3 Resource Binding (17% of Steam DX12 hardware* supporting it), they just miss some of the other DX12 stuff which Fiji would have

*DX12 hardware = hardware that supports DX12, even if it's limited to 11.x feature levels
 
I would think so.
The HBM spec prescribes 8 Banks for the 1 and 2 GBit per channel devices, going to 16 Banks with 4 GBit. Hynix could present their "dual link stacks" as 2 GBit per channel devices to the memory controller, but with 16 Banks to make the distinction to the normal "single link" stacks. The highest bit of the bank address would then be used by the logic base die to decide which stack to address. I guess using the bank adress this way would minimize the side effects compared to using the column address.

Separate stacks could have some different behaviors relative to a single stack when a "channel" is actually split across two physically distinct chips. Some penalties might be worse if hopping from bank 7 to bank 8. I suppose there is some possible latency added with an additional set of logic intercepting commands and routing them to the appropriate stack or selectively masking them. On the other hand, some penalties could be better if there are two stacks concurrently able to process a single channel's latencies, like bank activation limits if the dual-linked setup doesn't have to fit in a single stack's electrical limits.

The standard would set down the footprint of the modules, though. Unless the base die is already much larger than the memory slices, would the stacks be considered custom on top of the HBM standard? Area-wise, this doesn't seem possible to fit in the same footprint.
 
It only says that Fiji is full dx12, not that Hawaii isn't tier 3 as well. Fiji would have to support the additional feature set of dx12 as well as being tier 3 to be the first fully dx12 tier 3 GPU.
 
Separate stacks could have some different behaviors relative to a single stack when a "channel" is actually split across two physically distinct chips. Some penalties might be worse if hopping from bank 7 to bank 8. I suppose there is some possible latency added with an additional set of logic intercepting commands and routing them to the appropriate stack or selectively masking them. On the other hand, some penalties could be better if there are two stacks concurrently able to process a single channel's latencies, like bank activation limits if the dual-linked setup doesn't have to fit in a single stack's electrical limits.

The standard would set down the footprint of the modules, though. Unless the base die is already much larger than the memory slices, would the stacks be considered custom on top of the HBM standard? Area-wise, this doesn't seem possible to fit in the same footprint.
As the separate stacks would sit in very close proximity on the same logic die, I wouldn't expect much of a timing difference. And the signals from the GPU have to run through the PHYs on the base die anyway. Routing the data and address lines to the TSV contacts of the second stack just a few millimeters (the size of a 2 GBit die) away, is probably costing not much more time than the signals would travel inside a larger 4 GBit die if you adress another bank there.

And the HBM spec doesn't specify the outher dimensions of the stacks, just the ballout of the base die, which one could keep.
 
It only says that Fiji is full dx12, not that Hawaii isn't tier 3 as well. Fiji would have to support the additional feature set of dx12 as well as being tier 3 to be the first fully dx12 tier 3 GPU.

But what's stopping GCN 1.1/1.2 from being 12_0? The quote might mean this is the first GPU that's both 12_1 and has tier 3 binding support (Maxwell 2 should only be tier 2). It could also just be poor pr (i.e. nothing has changed since tonga, still 12_0 with tier 3). Who knows with feature levels...
 
Back
Top