AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

3dilettante · Jun 9, 2015

Infinisearch said:
I don't think frame buffer compression is as effective on HBM than GDDR5 since there are only 2 burst lengths rather than (IIRC) 4 for GDDR5.

I do not think that would matter, since the bus traffic is not what is being compressed. Compression is about accesses saved, not how the accesses travel over the bus. A normal GDDR5 transaction has a burst length of 8 and is going to provide 32 bytes, which may satisfy a physical ROP cache line since we don't know its granularity, but we know it would require two to satisfy a vector cache request.
A 128-bit HBM burst with burst length 2 is going to provide 32 bytes.

RecessionCone said:
No. HBM improves perf/W significantly for the memory interface, but the memory interface is only a small fraction of overall GPU power. The shader array is dominant, and its perf/w is unrelated to the memory interface.

The article's calculation is a straightforward 4096*2*1050/300W. The memory bus doesn't get to draw power or dissipate from a budget not on the card.

fellix · Jun 9, 2015

silent_guy said:
The last 5(?) years, we've seen Fermi with vastly less GFLOPS than AMD being competitive. And we've seen Maxwell significantly outperforming Kepler with less GFLOPS as well, both still less than AMD. So, yeah, GFLOPS/W doesn't mean at lot for gaming performance. Still 300W wouldn't be half bad for Fiji...

Kepler's raw ALU throughput is limited to ~75% utilization, anyway.

Alexko · Jun 9, 2015

Jawed said:
[…]Also GFLOPS/watt is a pretty misleading measure of efficiency, since games aren't ALU bound.

For GCN, one wonders where the bottleneck is exactly. It's not texture filtering, because GCN is generally better than Kepler/Maxwell at it, for equivalent GPUs (same level of gaming performance). It's not raw ALU power for the same reason, and it's likely not fillrate either, because while Maxwell is a good bit better, Kepler is a little worse.

Geometry performance is higher on NVIDIA GPUs but I think we would have been able to pinpoint that as a common bottleneck if it were one. I think NVIDIA's scheduling might be more effective, but I don't really have anything to support that, apart from the elimination of other obvious suspects.

mczak · Jun 9, 2015

3dilettante said:
That ROP count would have less bandwidth per ROP than Hawaii and Tonga as we know them, and much less relative to Tahiti.
Frame buffer compression might be useful in this case.

I don't think we know the exact bandwidth of Fury yet? But even if it's "only" 512 GB/s clearly 64 ROPs wouldn't be enough (Tahiti would have been in the same ballpark regarding ROP/bandwidth in this case but it was also quite widely criticized as probably not having quite enough ROPs, plus it didn't have FBC). AMD also never had any configurations with the ROP count not being a power of two, so that would rule out 96 probably without some rearchitecting. And even with 128 ROPs and 512 GB/s the ROP/bandwidth figure would in fact still be lower than any Maxwell GM2xx part (granted the ROP capabilities are somewhat different).
That said unlike nvidia amd so far didn't do parts which have vast excess of ROPs compared to rasterizer capabilities. Thus, this would imply having the ability to rasterize 4x32 or 8x16 pixels per clock.

Silent_Buddha · Jun 9, 2015

lanek said:
So it seems final spec are send over ... not much we wasnt know allready .. Or maybe on the ROP side.

- 4096SP ( 64CU )
- 256 TMU
- 128 ROP's
- 28.7 GFlops/Watt
http://wccftech.com/amd-radeon-fury-x-specs-fiji/#ixzz3cb4CwbMs

Well, that was unexpected. An air cooled reference cooler for the XT variant in addition to the water cooled reference cooler. I expected OEMs to do it, but not AMD.

Regards,
SB

wiak · Jun 9, 2015

silent_guy said:
FWIW: I remember a whole bunch of people crying foul when the GTX 680 came out with 1GB less than the 7970. Or the GTX780 with 1GB less than the 290X.

I think the amount of RAM of a GPU is very important not for general game performance, at least medium term, but as a marketing talking point. AMD used to have the upper hand there, now it's Nvidia. It will soon change again with 8GB Hawaii but complicated for Fiji.
Last month, I was in Fry's next go the GPU section. Saw a guy with his GF select a GPU and talking to her about why he chose one over the other: it was entirely based on the amount of GB that was printed on the box. There is large world out there like this and we have no idea.

HBM has 4096-bit memory bus, GDRR5 only has a 512-bit one

Albuquerque said:
Well, to be blunt, that perf/W may not be the "bigger story than HBM", rather it stands to reason that it's truly the story of HBM. Perhaps...

the reason we see a 300W is that HBM basicly cut the memory power usage in half or less compared to GDDR5, that gives them even more power budget to work with, that means they can clock the chip higher/more resources

Jawed · Jun 9, 2015

As the only meaningful exploration of GPU architectures these days is done by developers and they don't write articles, we're stuck.

Battlefield 4 with Mantle support is a straight demonstration that something in GCN is "broken". It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?

3dilettante · Jun 9, 2015

mczak said:
I don't think we know the exact bandwidth of Fury yet?

This is going by the table provided in the article. 1 Gbps to 1.2 Gbps has been listed as a possible range for early HBM production, but not for this GPU product specifically.

But even if it's "only" 512 GB/s clearly 64 ROPs wouldn't be enough (Tahiti would have been in the same ballpark regarding ROP/bandwidth in this case but it was also quite widely criticized as probably not having quite enough ROPs, plus it didn't have FBC).

One thing that was stated as being removed in the transition from Tahiti to Hawaii was a crossbar between the back ends and a few of the other memory controllers. Tahiti's presentation mentioned the possibility of localized bandwidth starvation at that level, even when global consumption had some to spare.
Hawaii had far more ROPs, and mildly improved bandwidth, but some flexibility may have been lost.

silent_guy · Jun 9, 2015

wiak said:
HBM has 4096-bit memory bus, GDRR5 only has a 512-bit one

(pedantic mode on)HBM memory has 1024 bit bus and GDDR5 has 32bits. Not that this has anything to do with what I was talking about, or that I have a clue about what point you're trying to get across...

wiak · Jun 9, 2015

SimBy said:
How easy it is to actually replace HBM with HBM2? Any changes to Fiji memory interface required?

here is my fury lineup prediction for 2016
Greenland XT 8GB HBM2
Greenland PRO 8GB HBM2
Fiji XT 4GB HBM
Fiji PRO 4GB HBM

this might be the reason for the 300 series rebranding

Infinisearch · Jun 10, 2015

silent_guy said:
and GDDR5 has 32bits

Isn't GDDR5 on GPU's 64bits per channel?

Jawed said:
It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?

I agree with this... recently I've been wondering if maxwell's performance has a lot to do with there "tweaked" L1/shared architecture in each SMM. (thats what they're called now right)

Alexko · Jun 10, 2015

Jawed said:
As the only meaningful exploration of GPU architectures these days is done by developers and they don't write articles, we're stuck.

Battlefield 4 with Mantle support is a straight demonstration that something in GCN is "broken". It's arguably not a simple "fundamental theoreticals" thing, like fillrate (would be so simple if it was) but seemingly more of a "how things are connected" problem. Memory hierarchy? Work distribution? Pipeline-stage load-balancing? ROP<->cache behaviour?

Agreed on all points except Battlefield 4, I don't understand what you mean there.

Infinisearch · Jun 10, 2015

3dilettante said:
I do not think that would matter, since the bus traffic is not what is being compressed. Compression is about accesses saved, not how the accesses travel over the bus. A normal GDDR5 transaction has a burst length of 8 and is going to provide 32 bytes, which may satisfy a physical ROP cache line since we don't know its granularity, but we know it would require two to satisfy a vector cache request.
A 128-bit HBM burst with burst length 2 is going to provide 32 bytes.

Do you know what the tile size for compression is? You're right though I assumed that sub-cacheline/sub-maximum burst length was being taken advantage of to save memory cycles on the "left over data" of a compressed tile.

silent_guy · Jun 10, 2015

Infinisearch said:
Isn't GDDR5 on GPU's 64bits per channel?

The GDDR5 chip has 32 bits. And as we've seen with the 970, at least on Nvidia GPUs, they can be access with 32 bits granularity. I'm sure the reality is more complex than that, and that there are probably certain architectural features oriented towards 64 bits...

Jawed · Jun 10, 2015

Alexko said:
Agreed on all points except Battlefield 4, I don't understand what you mean there.

http://techreport.com/review/28356/nvidia-geforce-gtx-980-ti-graphics-card-reviewed/10

Infinisearch · Jun 10, 2015

silent_guy said:
And as we've seen with the 970, at least on Nvidia GPUs, they can be access with 32 bits granularity.

Do you have something I can read in regards to this? Thanks.

silent_guy · Jun 10, 2015

Infinisearch said:
Do you have something I can read in regards to this? Thanks.

http://www.anandtech.com/show/8935/...cting-the-specs-exploring-memory-allocation/2
It shows one MC per individual GDDR5 chip and when you access the upper 0.5GB, you accessing one chip only. Now it's possible that both MCs for that 64 bit segment have the same address, but at least the chip select of each 32-bit MC can be controlled individually.

3dilettante · Jun 10, 2015

Infinisearch said:
Do you know what the tile size for compression is? You're right though I assumed that sub-cacheline/sub-maximum burst length was being taken advantage of to save memory cycles on the "left over data" of a compressed tile.

Not specifically. Papers on possible implementations like 16x16 or 8x8, among other sizes. 8x8 does align with a 64-element batch, but ROP arrays may operate in larger tiles.

Various DRAM flavors can employ a burst-chop mode, but depending on the standard it may cut the number of cycles data is transfered, but not the number of cycles in a burst. DDR3's mode will simply not select data for the first 4 or last 4, but the burst isn't shorter.
The data sheets for GDDR5 that I've found didn't show that option or any burst length but 8.

Esrever · Jun 10, 2015

Jawed said:
http://techreport.com/review/28356/nvidia-geforce-gtx-980-ti-graphics-card-reviewed/10

What?

Arty · Jun 10, 2015

RecessionCone said:
My favorite thing about 4 GB of memory were the endless 970 scandal threads. Clearly, 3.5 GB of memory is fatally too small. But now we know that 4 GB of memory is comfortable and not a problem at all!

Good job at the thread derail. You sound incredibly upset.

AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

3dilettante

fellix

Alexko

mczak

Silent_Buddha

wiak

Jawed

3dilettante

silent_guy

wiak

Infinisearch

Alexko

Infinisearch

silent_guy

Jawed

Infinisearch

silent_guy

3dilettante

Esrever

Arty

KEPLER

Similar threads