AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

For completeness (and to give some actual Fiji info to this thread), I calculated the cost to spill N registers to memory on Fiji. In this case we assume a big shader that fills the whole GPU. All threads start executing roughly at the same time, and spill roughly at the same time to memory. We assume full occupancy on all CUs.

64 CUs * 40 waves/CU * 64 threads/wave * N registers/thread * 4 bytes/register = N * 640 kB.

As said earlier, spilling one register is covered by L1 cache, and spilling three (640 kB * 3 = 1920 kB) is covered by L2. However this trashes all L1 and L2 caches completely, so the GPU is certainly going to stall for a while, unless all the other data needed for work is already in LDS.

GCN also has 8kB of scalar registers (SGPR) per CU. This gives fast storage space for 51 extra registers per wave (one 32 bit value per 64 threads). This is the best way to store data that is constant across the thread group. Unfortunately PC graphics APIs do not expose the scalar unit registers directly. The compiler can take advantage of SGPRs in some specific cases, for example when it knows for sure (at compile time) that all threads in the wave would load data from the same address (static constant buffer load for example or buffer indexed load using SV_GroupId as the index). This is a great way to reduce VGPR pressure.
 
Last edited:
So at the end of the day, when using air cooling, once you're above a certain power threshold, the size of the final thing is dictated by the power consumption, not by the size of the PCB...
 
Well, looking at the cooler of this one, the Nano SKU must be hell of a binned part to be kept operational by a third of that.
 
That was discussed months ago :).


I've already forgot whatever was different, except for the F16 ops (with builtin swizzling, but only for VOP2 instructions, so not MAD).

The scalar pipeline and scalar cache were promoted to support writes.
Compute context switching was mentioned as a generational change.
There are FP16 ops, but also sub-word addressing that can extract byte-sized data.

The limited cross-lane functionality AMD briefly mentioned is in the DPP format. I wondered at one point if this was re-using the non-storage LDS permute functionality, but the fixed set of combinations might be evidence of AMD partially exposing to the encoding control of the microded path that supports multicycle and quad-based operations like SAD. The more general permute might be wiring constrained, so the space afforded a separate section for the LDS logic covers the more complex cases.
There are a few wait state requirements that hint to this being off the primary vector pipeline, as read after write hazards exist. The latency between setting the execution mask and when the DDP path can use it is interestingly high, which could be a domain crossing.

There's also a raft of encoding changes and deletions/refactorings of instructions and supported formats, which shows why low-level APIs are not as low-level as some things could be.

GCN has been tweaked over time, with some cross-domain straddlings that lead to manual wait states. The less integrated, the longer or more restrictive the wait states. The biggest cases are the flat addressing modes, operations that mess with registers that are aliased with the the scalar domain or context-level flags, round trips between vector and scalar, and DPP.
Flat addressing is interesting in that it needs to monitor both the LDS and vector memory counts, which is two domains the hardware is least able to paper over the disparities.

I'd like to imagine there are hints as to where this could be going, but reading these tea leaves is problematic.
LDS has had the margins of its functionality nibbled at over time, and at least in the compute field its charms are sometimes lost on developers.
The sub-word addressing opens up some data handling that could lead to lower-power operation in certain cases, but fully utilizing the registers creates a vector-like situation or an implicit set of batch sizes besides GCN's 64.
I think there are some wacky things that could be done with all these hints, but at the very least AMD hasn't promised much change with their next-gen power-efficiency beyond FinFET. I would think a number of refinements in these awkward corners would have shown up somewhere if they were being planned, with the proviso that projections can be vague.
 
Some hard-to-read text on the box lists the memory bandwidth as 450GB/sec.

It's definitely the same 512GB/s bandwidth as the Fury X, since it clearly states 500MHz memory on a 4096bit bus:

jV2OKeO.jpg



Well, looking at the cooler of this one, the Nano SKU must be hell of a binned part to be kept operational by a third of that.

That's definitely not a reference cooler, though there might not even be one, like all the 300 series.
Regardless, the Nano is more like 2/3rds or 3/5ths of the Fury:

7YSzJ6f.jpg


Looks just as long as a 980Ti with an aftermarket cooler:

8OkK133.jpg
 
I somewhat really like that they have keep the small PCB lenght, remove the cooling, put a good H2o EK waterblock on it ( allready available )... i can imagine dual fan solution like the Asus DC2 or MSI one will be shorter than this one
 
When I see it correctly, that's the same cooler that allows the Zotac 980 Ti Omega-End-of-all-things-Edition to run at 1253 MHz baseclock, thus 25% higher than the reference 980 Ti, the Fury X battles with:

No, this is the regular "AMP" card with a 1050MHz base clock and 1150MHz boost clock.
I'm guessing at least the fans in the other run faster and louder, and the GPU is probably higher-binned and overvolted.


This Fury model that leaked uses Sapphire's Tri-X cooler, which the brand has been using for over 1.5 years on several AMD cards, including an overclocked R9 270X Pitcairn that consumes less than 200W:

http://www.legitreviews.com/amd-radeon-r9-270x-sapphire-toxic-r9-270x-review_125979



So don't take this particular model as an absolute requirement for the Fury's aircooler. There will probably be smaller cards.
 
When I see it correctly, that's the same cooler that allows the Zotac 980 Ti Omega-End-of-all-things-Edition to run at 1253 MHz baseclock
What is that rectangular device on the back there, the one with the two red stripes running down the top of it - the mother of all supercaps maybe?
 
Back
Top