AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

I don't think the memory portion of the power is the critical element, more the PHY, which is "on core" either way.
 
Both are, IMHO, more ot the host's side, of course. After all, the clock rate on the DRAM cells in HBM is much lower than GDDR5.
 
1TB/s bandwidth is going to be possible within 18 months. There's no way to build a single chip GPU with GDDR5 that's going to hit that bandwidth.

Of course one could argue that 1TB/s bandwidth is excessive. It appears that rendering algorithms are in the extreme profligacy era, due to plain brute force stupidity encouraged by dumb rasterisation.

The idea that each pixel per second requires about 2KB of data, source + intermediate + pixel, is just absurd (3840x2160 x 60 fps x 2000 = ~1,000,000,000,000) .
 
The idea that each pixel per second requires about 2KB of data, source + intermediate + pixel, is just absurd (3840x2160 x 60 fps x 2000 = ~1,000,000,000,000) .

Not so much. If you talk about one pass, yes. But a pixel is spoiled by a lot of data in the vicinity. 2KB is 20x25 pixel neighbourhood just once. Changing from gathering to scattering will change relatively little. Changing to progressive multi-resolution data is not trivial, but that would promise more savings.
 
Going by the old rules of thumb for Bytes/s per FLOP/s there are HPC workloads that would love that kind of bandwidth, and even that wouldn't bring it back to where some would like.

There are also plenty of compute examples that become rapidly bandwidth-limited without significant reworking, just to work around the coarse granularity and poor divergence handling.
 
This very much depends on the economics of 14nm vs 28nm, and the expected size of the laptop market that would like Pitcairn-level perf at much lower power use. If that market is small enough (just Apple?) then satisfying it with that 150mm² die severely underclocked like the Fury Nano might be a good idea. If it's bigger, reducing the manufacturing cost might be worth it.

In any case, the design + mask work per die type for 14nm will be much more than it was at 28nm, it's sane to expect fewer designs even if that means being slightly less efficient per mm². A full lineup made of only two distinct dies (with plenty of harvested variants) might not be that insane.
I wonder if a "split" lineup for at least the first generation of 14/16 nm AMD GPUs makes sense, one like the following:

Code:
AMD 2016 GPU lineup (from Oland up)

 GDDR5        HBM2
————————————|——————————————————————————————
              HBM2 4 stacks [450-500 mm^2]
              HBM2 3 stacks [300-400 mm^2] 
 Hawaii
 Tonga
 Pitcairn     HBM2 1 stack [125-150 mm^2]
 Bonaire
 Cape Verde
 Oland

HBM2 chips show up in two places: at the top end where GDDR5 cannot reach and at a spot in the midrange. (The HBM2 3 stacks chip can be replaced with a 2 stacks chip or be eliminated entirely if we take the two die possibility. I am also assuming that the highest-end chip will have 4 stacks.) This spot targets MacBook Pro-type laptops which desire decent performance with small size and very low power consumption. Tonga and other parts remain an option for those whose first priority is low price.

Of course, this all assumes that the costs of 14/16 nm plus HBM2 are too high for the HBM2 1 stack chip to outright replace most of the range from Tonga to Cape Verde in both the laptop and desktop segments.
 
This spot targets MacBook Pro-type laptops which desire decent performance with small size and very low power consumption. Tonga and other parts remain an option for those whose first priority is low price.

I'm expecting that 14nm mid-range with a single HBM2 stack to actually have a performance similar to a full Tonga. A single stack drives 256GB/s at 500MHz, which is more than the 176GB/s than the R9 380 has.and very close to the 260GB/s that a 384bit Tonga would achieve.
 
I'm expecting that 14nm mid-range with a single HBM2 stack to actually have a performance similar to a full Tonga. A single stack drives 256GB/s at 500MHz, which is more than the 176GB/s than the R9 380 has.and very close to the 260GB/s that a 384bit Tonga would achieve.
I don't really expect the announced doubled frequency of HBM2 initially. Or at least I won't be surprised if it's just going to be somewhat higher than with HBM1. That said, such a chip should imho still be able to possibly replace both Tonga and Pitcairn.
(I don't think though replacing anything with Bonaire or below would be an option with such a chip, well maybe for mobile where AMD really needs all perf/w improvements they can possibly get.)
 
I've stated many times that I expect AMD to come up with architectural improvement for their next generation: they must have done *something* in the last 3 years. If all goes well, this new architecture will negate the requirement to use HBM, just the way it does for Nvidia.

Except relying on GDDR5 for their next generation would leave them behind Nvidia who are designing Pascal with HBM in mind.

http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/

That blog entry doesn't mention HBM specifically but they mentioned at their GPU tech conference that they would be using HBM and not HMC.

So, if Nvidia are planning to use HBM for their Pascal lineup why would it be bad if AMD did it for their next lineup?

Basically, not using HBM would leave them at a competitive disadvantage.

Regards,
SB
 
I don't really expect the announced doubled frequency of HBM2 initially. Or at least I won't be surprised if it's just going to be somewhat higher than with HBM1. That said, such a chip should imho still be able to possibly replace both Tonga and Pitcairn.

Eeerm.. is HBM2's increased bandwidth related to frequencies alone? I thought it as simply 2x wider for using 2x more stacks.
Do you know where I can read about that?

(I don't think though replacing anything with Bonaire or below would be an option with such a chip, well maybe for mobile where AMD really needs all perf/w improvements they can possibly get.)
They could use underclocked+undervolted single-stack HBM2 for a mobile chip with a performance target of GM107/Bonaire.
 
Eeerm.. is HBM2's increased bandwidth related to frequencies alone? I thought it as simply 2x wider for using 2x more stacks.
Do you know where I can read about that?

HBM2 is supposed to clock (up to?) twice as high, that's the source of its increased bandwidth. There's a couple of slides here:
http://www.kitguru.net/components/g...mory-chips-opens-way-for-32gb-graphics-cards/

I think you can also use more stacks to further increase bandwidth, but obviously that's not free.
 
A single 8-stack HBM2 device could provide 8GB pool at 256GB/s throughput -- a perfect solution for a ultra-low power APU in a compact package.

So how far are we from laptops with only APU sockets and no memory slots?
 
Naah, apple will just solder everything to prevent upgrades.

I meant sockets.
 
I guess mote M.2 slots, but no more DIMMs.

Quad-core Zen, with 8xCU IGP and 8~16GB HBM2 on the SoC package, complemented by a speedy M.2 1TB SSD with an option for a second one.

Asus, take a note for your 2016-2017 laptop line, please! ;)
 
Last edited:
http://videocardz.com/55259/sk-hynix-shows-off-hbm1-and-hbm2

The sheet says HBM2 doubles the data rate, without touching the I/O prefetch, so it must be the interface clock going up.

p.s.: A single 8-stack HBM2 device could provide 8GB pool at 256GB/s throughput -- a perfect solution for a ultra-low power APU in a compact package.
The slide also says, HBM2 transfers 64 byte chunks of data instead of 32. So, something's fishy there.
 
Burst length can be set to 2 or 4. The command rate is sufficient to handle a full stream of reads at burst length 2, but the longer burst allows for spare time to sneak something in. The slides with HBM2's pseudo channel method rely on that.
 
A single 8-stack HBM2 device could provide 8GB pool at 256GB/s throughput -- a perfect solution for a ultra-low power APU in a compact package.
Halve the Fiji GPU (32 CUs) + add four Zen cores (4 cores = 8 threads) on the same die + 8 GB of 256GB/s HBM2. Would make a perfect high end gaming laptop (with reduced GPU clocks to get better perf/watt). No discrete GPU required.

Btw. How do the HBM2 latencies compare to DDR3/DDR4? CPU part needs low latency.
 
Halve the Fiji GPU (32 CUs) + add four Zen cores (4 cores = 8 threads) on the same die + 8 GB of 256GB/s HBM2. Would make a perfect high end gaming laptop (with reduced GPU clocks to get better perf/watt). No discrete GPU required.

Btw. How do the HBM2 latencies compare to DDR3/DDR4? CPU part needs low latency.

I don't think we have a clear comparison between first-gen HBM and DDR3/4.
However, the overall organization of the DRAM hasn't changed, and there were statements to the effect that it's mostly a wash compared to GDDR5.
The PS4 and Xbox One latencies show similar latency contributions from DRAM between DDR3 and GDDR5.
The very large latency contribution that AMD's memory hierarchy has before going to DRAM is something AMD has indicated they are fixing with Zen, so I think it stands a good chance of improving even if HBM is somewhat slower.
 
Back
Top