AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

sebbbi · Jul 7, 2015

For completeness (and to give some actual Fiji info to this thread), I calculated the cost to spill N registers to memory on Fiji. In this case we assume a big shader that fills the whole GPU. All threads start executing roughly at the same time, and spill roughly at the same time to memory. We assume full occupancy on all CUs.

64 CUs * 40 waves/CU * 64 threads/wave * N registers/thread * 4 bytes/register = N * 640 kB.

As said earlier, spilling one register is covered by L1 cache, and spilling three (640 kB * 3 = 1920 kB) is covered by L2. However this trashes all L1 and L2 caches completely, so the GPU is certainly going to stall for a while, unless all the other data needed for work is already in LDS.

GCN also has 8kB of scalar registers (SGPR) per CU. This gives fast storage space for 51 extra registers per wave (one 32 bit value per 64 threads). This is the best way to store data that is constant across the thread group. Unfortunately PC graphics APIs do not expose the scalar unit registers directly. The compiler can take advantage of SGPRs in some specific cases, for example when it knows for sure (at compile time) that all threads in the wave would load data from the same address (static constant buffer load for example or buffer indexed load using SV_GroupId as the index). This is a great way to reduce VGPR pressure.

sebbbi · Jul 7, 2015

GCN3 ISA document: http://amd-dev.wpengine.netdna-cdn..../07/AMD_GCN3_Instruction_Set_Architecture.pdf

The best source for architectural differences in Fiji/Tonga compared to older GCN chips. I have already found some interesting improvements

CarstenS · Jul 7, 2015

AlNets said:
Yup. Seemingly already in production (alongside 8Gbit density).

http://www.samsung.com/global/busin...ases/detail?cateSearchParam=N001&newsId=13921

As well as Hynix.
https://www.skhynix.com/inc/pdfDownload.jsp?path=/datasheet/Databook/Databook_Q3'2015_Graphics.pdf

mczak · Jul 7, 2015

sebbbi said:
GCN3 ISA document: http://amd-dev.wpengine.netdna-cdn..../07/AMD_GCN3_Instruction_Set_Architecture.pdf

That was discussed months ago

.

The best source for architectural differences in Fiji/Tonga compared to older GCN chips. I have already found some interesting improvements

I've already forgot whatever was different, except for the F16 ops (with builtin swizzling, but only for VOP2 instructions, so not MAD).

CarstenS · Jul 7, 2015

So, Fury non-X seemingly uses the same PCB as Fury X only with a larger cooler sticking out of the back.
http://www.pcgameshardware.de/AMD-R.../News/Fiji-Pro-mit-3584-und-1000-MHz-1164062/
Leak source:
http://videocardz.com/57078/exclusive-sapphire-radeon-r9-fury-pictured-specifications-confirmed

silent_guy · Jul 7, 2015

So at the end of the day, when using air cooling, once you're above a certain power threshold, the size of the final thing is dictated by the power consumption, not by the size of the PCB...

spworley · Jul 7, 2015

CarstenS said:
So, Fury non-X seemingly uses the same PCB as Fury X only with a larger cooler sticking out of the back.

Some hard-to-read text on the box lists the memory bandwidth as 450GB/sec.

fellix · Jul 7, 2015

Well, looking at the cooler of this one, the Nano SKU must be hell of a binned part to be kept operational by a third of that.

3dilettante · Jul 7, 2015

mczak said:
That was discussed months ago .

I've already forgot whatever was different, except for the F16 ops (with builtin swizzling, but only for VOP2 instructions, so not MAD).

The scalar pipeline and scalar cache were promoted to support writes.
Compute context switching was mentioned as a generational change.
There are FP16 ops, but also sub-word addressing that can extract byte-sized data.

The limited cross-lane functionality AMD briefly mentioned is in the DPP format. I wondered at one point if this was re-using the non-storage LDS permute functionality, but the fixed set of combinations might be evidence of AMD partially exposing to the encoding control of the microded path that supports multicycle and quad-based operations like SAD. The more general permute might be wiring constrained, so the space afforded a separate section for the LDS logic covers the more complex cases.
There are a few wait state requirements that hint to this being off the primary vector pipeline, as read after write hazards exist. The latency between setting the execution mask and when the DDP path can use it is interestingly high, which could be a domain crossing.

There's also a raft of encoding changes and deletions/refactorings of instructions and supported formats, which shows why low-level APIs are not as low-level as some things could be.

GCN has been tweaked over time, with some cross-domain straddlings that lead to manual wait states. The less integrated, the longer or more restrictive the wait states. The biggest cases are the flat addressing modes, operations that mess with registers that are aliased with the the scalar domain or context-level flags, round trips between vector and scalar, and DPP.
Flat addressing is interesting in that it needs to monitor both the LDS and vector memory counts, which is two domains the hardware is least able to paper over the disparities.

I'd like to imagine there are hints as to where this could be going, but reading these tea leaves is problematic.
LDS has had the margins of its functionality nibbled at over time, and at least in the compute field its charms are sometimes lost on developers.
The sub-word addressing opens up some data handling that could lead to lower-power operation in certain cases, but fully utilizing the registers creates a vector-like situation or an implicit set of batch sizes besides GCN's 64.
I think there are some wacky things that could be done with all these hints, but at the very least AMD hasn't promised much change with their next-gen power-efficiency beyond FinFET. I would think a number of refinements in these awkward corners would have shown up somewhere if they were being planned, with the proviso that projections can be vague.

Deleted member 13524 · Jul 7, 2015

spworley said:
Some hard-to-read text on the box lists the memory bandwidth as 450GB/sec.

It's definitely the same 512GB/s bandwidth as the Fury X, since it clearly states 500MHz memory on a 4096bit bus:

fellix said:
Well, looking at the cooler of this one, the Nano SKU must be hell of a binned part to be kept operational by a third of that.

That's definitely not a reference cooler, though there might not even be one, like all the 300 series.
Regardless, the Nano is more like 2/3rds or 3/5ths of the Fury:

Looks just as long as a 980Ti with an aftermarket cooler:

Grall · Jul 7, 2015

silent_guy said:
So at the end of the day, when using air cooling, once you're above a certain power threshold, the size of the final thing is dictated by the power consumption, not by the size of the PCB...

Yes, naturally, although having no PCB at all blocking the flow of that rearmost fan ought to do wonders for overall cooling efficiency.

CarstenS · Jul 7, 2015

spworley said:
Some hard-to-read text on the box lists the memory bandwidth as 450GB/sec.

Same mistake (and same box) as with retail Fury X?

lanek · Jul 7, 2015

I somewhat really like that they have keep the small PCB lenght, remove the cooling, put a good H2o EK waterblock on it ( allready available )... i can imagine dual fan solution like the Asus DC2 or MSI one will be shorter than this one

CarstenS · Jul 7, 2015

ToTTenTranz said:
Looks just as long as a 980Ti with an aftermarket cooler:

When I see it correctly, that's the same cooler that allows the Zotac 980 Ti Omega-End-of-all-things-Edition to run at 1253 MHz baseclock, thus 25% higher than the reference 980 Ti, the Fury X battles with:

Correct?

So, should we be in for a battle of awesome graphics power?

Deleted member 13524 · Jul 7, 2015

CarstenS said:
When I see it correctly, that's the same cooler that allows the Zotac 980 Ti Omega-End-of-all-things-Edition to run at 1253 MHz baseclock, thus 25% higher than the reference 980 Ti, the Fury X battles with:

No, this is the regular "AMP" card with a 1050MHz base clock and 1150MHz boost clock.
I'm guessing at least the fans in the other run faster and louder, and the GPU is probably higher-binned and overvolted.

This Fury model that leaked uses Sapphire's Tri-X cooler, which the brand has been using for over 1.5 years on several AMD cards, including an overclocked R9 270X Pitcairn that consumes less than 200W:

http://www.legitreviews.com/amd-radeon-r9-270x-sapphire-toxic-r9-270x-review_125979

So don't take this particular model as an absolute requirement for the Fury's aircooler. There will probably be smaller cards.

CarstenS · Jul 7, 2015

Correct, the cooler of the monster is beefier:
http://www.zotac.com/products/graph...ime/order/DESC/amount/10/section/gallery.html
http://www.zotac.com/products/graph...ime/order/DESC/amount/10/section/gallery.html

But apparently not much longer, which is what this dicussion was about.

Grall · Jul 7, 2015

CarstenS said:
When I see it correctly, that's the same cooler that allows the Zotac 980 Ti Omega-End-of-all-things-Edition to run at 1253 MHz baseclock

What is that rectangular device on the back there, the one with the two red stripes running down the top of it - the mother of all supercaps maybe?

CarstenS · Jul 7, 2015

Probably, something that's supposed to reach Ub0r-1337 OC.

spworley · Jul 7, 2015

CarstenS said:
Same mistake (and same box) as with retail Fury X?

Has anyone actually measured and confirmed Fury X's bandwidth as 512GB/sec? Tech report measured Fury X's random texture sampling bandwidth as 333 GB/sec, but that's a low bound, not an estimate, of the full bandwidth. But has any other benchmark tool verified the full 512, perhaps with an OpenCL bandwidth test?

fellix · Jul 7, 2015

spworley said:
Has anyone actually measured and confirmed Fury X's bandwidth as 512GB/sec? Tech report measured Fury X's random texture sampling bandwidth as 333 GB/sec, but that's a low bound, not an estimate, of the full bandwidth. But has any other benchmark tool verified the full 512, perhaps with an OpenCL bandwidth test?

You can't expect 99% sustained performance to the theoretical figures of any kind of DRAM at the time.

AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

sebbbi

sebbbi

CarstenS

Moderator

mczak

CarstenS

Moderator

silent_guy

spworley

fellix

3dilettante

Deleted member 13524

Guest

Grall

Invisible Member

CarstenS

Moderator

lanek

CarstenS

Moderator

Deleted member 13524

Guest

CarstenS

Moderator

Grall

Invisible Member

CarstenS

Moderator

spworley

fellix

Similar threads