AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

I don't think frame buffer compression is as effective on HBM than GDDR5 since there are only 2 burst lengths rather than (IIRC) 4 for GDDR5.
I do not think that would matter, since the bus traffic is not what is being compressed. Compression is about accesses saved, not how the accesses travel over the bus. A normal GDDR5 transaction has a burst length of 8 and is going to provide 32 bytes, which may satisfy a physical ROP cache line since we don't know its granularity, but we know it would require two to satisfy a vector cache request.
A 128-bit HBM burst with burst length 2 is going to provide 32 bytes.

No. HBM improves perf/W significantly for the memory interface, but the memory interface is only a small fraction of overall GPU power. The shader array is dominant, and its perf/w is unrelated to the memory interface.
The article's calculation is a straightforward 4096*2*1050/300W. The memory bus doesn't get to draw power or dissipate from a budget not on the card.
 
The last 5(?) years, we've seen Fermi with vastly less GFLOPS than AMD being competitive. And we've seen Maxwell significantly outperforming Kepler with less GFLOPS as well, both still less than AMD. So, yeah, GFLOPS/W doesn't mean at lot for gaming performance. Still 300W wouldn't be half bad for Fiji...

Kepler's raw ALU throughput is limited to ~75% utilization, anyway.
 
[…]Also GFLOPS/watt is a pretty misleading measure of efficiency, since games aren't ALU bound.

For GCN, one wonders where the bottleneck is exactly. It's not texture filtering, because GCN is generally better than Kepler/Maxwell at it, for equivalent GPUs (same level of gaming performance). It's not raw ALU power for the same reason, and it's likely not fillrate either, because while Maxwell is a good bit better, Kepler is a little worse.

Geometry performance is higher on NVIDIA GPUs but I think we would have been able to pinpoint that as a common bottleneck if it were one. I think NVIDIA's scheduling might be more effective, but I don't really have anything to support that, apart from the elimination of other obvious suspects.
 
That ROP count would have less bandwidth per ROP than Hawaii and Tonga as we know them, and much less relative to Tahiti.
Frame buffer compression might be useful in this case.
I don't think we know the exact bandwidth of Fury yet? But even if it's "only" 512 GB/s clearly 64 ROPs wouldn't be enough (Tahiti would have been in the same ballpark regarding ROP/bandwidth in this case but it was also quite widely criticized as probably not having quite enough ROPs, plus it didn't have FBC). AMD also never had any configurations with the ROP count not being a power of two, so that would rule out 96 probably without some rearchitecting. And even with 128 ROPs and 512 GB/s the ROP/bandwidth figure would in fact still be lower than any Maxwell GM2xx part (granted the ROP capabilities are somewhat different).
That said unlike nvidia amd so far didn't do parts which have vast excess of ROPs compared to rasterizer capabilities. Thus, this would imply having the ability to rasterize 4x32 or 8x16 pixels per clock.
 
FWIW: I remember a whole bunch of people crying foul when the GTX 680 came out with 1GB less than the 7970. Or the GTX780 with 1GB less than the 290X.

I think the amount of RAM of a GPU is very important not for general game performance, at least medium term, but as a marketing talking point. AMD used to have the upper hand there, now it's Nvidia. It will soon change again with 8GB Hawaii but complicated for Fiji.
Last month, I was in Fry's next go the GPU section. Saw a guy with his GF select a GPU and talking to her about why he chose one over the other: it was entirely based on the amount of GB that was printed on the box. There is large world out there like this and we have no idea.
HBM has 4096-bit memory bus, GDRR5 only has a 512-bit one
Well, to be blunt, that perf/W may not be the "bigger story than HBM", rather it stands to reason that it's truly the story of HBM. Perhaps...
the reason we see a 300W is that HBM basicly cut the memory power usage in half or less compared to GDDR5, that gives them even more power budget to work with, that means they can clock the chip higher/more resources
 
As the only meaningful exploration of GPU architectures these days is done by developers and they don't write articles, we're stuck.

Battlefield 4 with Mantle support is a straight demonstration that something in GCN is "broken". It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?
 
I don't think we know the exact bandwidth of Fury yet?
This is going by the table provided in the article. 1 Gbps to 1.2 Gbps has been listed as a possible range for early HBM production, but not for this GPU product specifically.

But even if it's "only" 512 GB/s clearly 64 ROPs wouldn't be enough (Tahiti would have been in the same ballpark regarding ROP/bandwidth in this case but it was also quite widely criticized as probably not having quite enough ROPs, plus it didn't have FBC).
One thing that was stated as being removed in the transition from Tahiti to Hawaii was a crossbar between the back ends and a few of the other memory controllers. Tahiti's presentation mentioned the possibility of localized bandwidth starvation at that level, even when global consumption had some to spare.
Hawaii had far more ROPs, and mildly improved bandwidth, but some flexibility may have been lost.
 
HBM has 4096-bit memory bus, GDRR5 only has a 512-bit one
(pedantic mode on)HBM memory has 1024 bit bus and GDDR5 has 32bits. Not that this has anything to do with what I was talking about, or that I have a clue about what point you're trying to get across...
 
How easy it is to actually replace HBM with HBM2? Any changes to Fiji memory interface required?
here is my fury lineup prediction for 2016
Greenland XT 8GB HBM2
Greenland PRO 8GB HBM2
Fiji XT 4GB HBM
Fiji PRO 4GB HBM

this might be the reason for the 300 series rebranding
 
and GDDR5 has 32bits
Isn't GDDR5 on GPU's 64bits per channel?

It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?
I agree with this... recently I've been wondering if maxwell's performance has a lot to do with there "tweaked" L1/shared architecture in each SMM. (thats what they're called now right)
 
As the only meaningful exploration of GPU architectures these days is done by developers and they don't write articles, we're stuck.

Battlefield 4 with Mantle support is a straight demonstration that something in GCN is "broken". It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?

Agreed on all points except Battlefield 4, I don't understand what you mean there.
 
I do not think that would matter, since the bus traffic is not what is being compressed. Compression is about accesses saved, not how the accesses travel over the bus. A normal GDDR5 transaction has a burst length of 8 and is going to provide 32 bytes, which may satisfy a physical ROP cache line since we don't know its granularity, but we know it would require two to satisfy a vector cache request.
A 128-bit HBM burst with burst length 2 is going to provide 32 bytes.
Do you know what the tile size for compression is? You're right though I assumed that sub-cacheline/sub-maximum burst length was being taken advantage of to save memory cycles on the "left over data" of a compressed tile.
 
Isn't GDDR5 on GPU's 64bits per channel?
The GDDR5 chip has 32 bits. And as we've seen with the 970, at least on Nvidia GPUs, they can be access with 32 bits granularity. I'm sure the reality is more complex than that, and that there are probably certain architectural features oriented towards 64 bits...
 
Last edited:
Do you know what the tile size for compression is? You're right though I assumed that sub-cacheline/sub-maximum burst length was being taken advantage of to save memory cycles on the "left over data" of a compressed tile.
Not specifically. Papers on possible implementations like 16x16 or 8x8, among other sizes. 8x8 does align with a 64-element batch, but ROP arrays may operate in larger tiles.

Various DRAM flavors can employ a burst-chop mode, but depending on the standard it may cut the number of cycles data is transfered, but not the number of cycles in a burst. DDR3's mode will simply not select data for the first 4 or last 4, but the burst isn't shorter.
The data sheets for GDDR5 that I've found didn't show that option or any burst length but 8.
 
Back
Top