AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

It's common to see comments about how Nvidia and AMD (and Intel when people still cared about CPUs) are still tuning clocks etc very short before release. And inevitable it's about how they can still go up.

Here's my take on this: I've never seen silicon speeds go up after the first weeks of bring up. They always go down: corner silicon doesn't perform as expected, false paths rear their ugly heads on some samples etc.

And second: going to mass production is a very drastic step with a lot of red tape. You do an initial trial production run and a larger volume trial run and you analyze all the failures. And, most important, you don't touch a single parameter. Definitely not clocks.

Soalways take those comments about clocks not being final with a great deal of salt: it's very likely to be all in the imagination of the writer who has no clue. Especially 2 weeks before launch, when all parameters should have locked for many weeks.
 
GCN (like the VLIW chips) has scratch registers: registers that live in global memory. It's unclear whether they're cached. But the performance with them sucks horribly.
At least in this proposal, this tier of registers is physically adjacent to the ALUs which provides them with as much or possibly more bandwidth than the primary register file.
The wiring at that juncture might be too congested to get fancy enough to augment vector register bandwidth, something that a number of CPUs have done to make up for having fewer ports than their ALUs could consume at peak.
There could be some nice side benefits if there were a way to do this, besides power.
If one wavefront could successfully hide its accesses in the cache, the register file itself might be available for miscellaneous operations that need register access (LDS to VREG bypass, exports from VREG, etc.). The wavefront itself would not notice, but the CU overall might see better concurrency in getting movement on the other instruction queue types, or values from other domains, like scalar registers, might be able to be sourced more often after a move to the register cache/extended bypass.

You could start here and work backwards:

Xbox One November SDK Leaked
One question I have is that if the off-chip mode is not an Xbox-specific feature, why has that option not been exercised? There are definitely clear disparities in bandwidth between solutions that provide rather close benchmark numbers.
The on-chip vs. off-chip dichotomy is something GCN's memory pipeline is a philosophical example of: an advanced functionality case that works trivially due to an unadventurous physical fixation, and an expensive un-evolved fallback.

This may have come from the insistence that the CU arrays be so heavily decoupled, where movement to and from the fixed function domain is more of a straw than the compute domain is used to.
Nvidia implemented an interconnect that distributed this more freely. Possibly, their implementation is able to spawn DS instances and clone the necessary parameters and contexts, while being able to provide a stream from the tessellator to the cloned instances.
AMD does not seem to have this readily available, unless the DS CU is made so that it writes out all that data, and then the ostensibly elegant memory pipeline becomes the distributor. And then we find that this conventional memory system does not "push" data well, and the less-advanced cache and memory hierarchy are now unable to be hidden.


It's common to see comments about how Nvidia and AMD (and Intel when people still cared about CPUs) are still tuning clocks etc very short before release. And inevitable it's about how they can still go up.

Here's my take on this: I've never seen silicon speeds go up after the first weeks of bring up. They always go down: corner silicon doesn't perform as expected, false paths rear their ugly heads on some samples etc.

And second: going to mass production is a very drastic step with a lot of red tape. You do an initial trial production run and a larger volume trial run and you analyze all the failures. And, most important, you don't touch a single parameter. Definitely not clocks.

Soalways take those comments about clocks not being final with a great deal of salt: it's very likely to be all in the imagination of the writer who has no clue. Especially 2 weeks before launch, when all parameters should have locked for many weeks.

In other cases, it may be that the source is operating at the end of a grapevine, where the rumor sites breathlessly as breaking news events that have long-since been resolved.
Whether this GPU will be considered mass-produced for the X SKU or not, all speculation has been for a solution that is running on the edge for power consumption.
AMD could be tweaking its turbo bins on silicon it has already validated on a range, or fixing its firmware. It may be that silicon never physically gets what is hoped for, but the complexity of the DVFS implementation--and possibly AMD flubbing this again (Jaguar to Kabini, Trinity to Richland, 7970 to 7970 GHz edition, Kaveri to Godavari, probably something in the 3xx series rebrand stack)-- could leave a lot of slack below that point.
Possibly "working on clocks" is gauging the highest speed bin AMD can get enough of.
 
At least in this proposal,
Better link:

https://research.nvidia.com/publication/compile-time-managed-multi-level-register-file-hierarchy

this tier of registers is physically adjacent to the ALUs which provides them with as much or possibly more bandwidth than the primary register file.
The wiring at that juncture might be too congested to get fancy enough to augment vector register bandwidth, something that a number of CPUs have done to make up for having fewer ports than their ALUs could consume at peak.
There could be some nice side benefits if there were a way to do this, besides power.
The paper is years old. The baseline is NVidia's old, shockingly slow and inefficient compute units, with the absurd register v work group scoreboarding and lots of other nonsense that NVidia's now abandoned. On GPUs whose compute performance and density was terrible anyway. It's called low hanging fruit.

AMD was using a register forwarding network (LRF in the paper) in the VLIW architecture. It is right there in the compiled code.

I'm unclear on whether there's such a network in GCN. It's certainly not explicit in the compiled code. It seems doubtful. (I'm not trying to suggest that LRF is all that's in the paper.)

I'm certainly not saying that AMD doesn't need to be careful with RF power. But it's worth remembering though how simple all coherent RF accesses are in GCN, to the extent that there's no need to implement banking within the RF.

Indexing slows things down and almost certainly wastes power. Incoherently indexed registers are pretty rare in GPU code though.

One question I have is that if the off-chip mode is not an Xbox-specific feature, why has that option not been exercised?
It has.
 
The paper is years old. The baseline is NVidia's old, shockingly slow and inefficient compute units, with the absurd register v work group scoreboarding and lots of other nonsense that NVidia's now abandoned. On GPUs whose compute performance and density was terrible anyway. It's called low hanging fruit.
Maxwell has explicit marking for caching reused registers, allowing for at least some accesses to the register file to be elided.
This is also not an era where there are many low hanging fruits left, and turning one's nose up for years at less than spectacular quick fixes gives us the power/performance matchup we see today.

AMD was using a register forwarding network (LRF in the paper) in the VLIW architecture. It is right there in the compiled code.
The VLIW exposed a network that most CPUs have implicitly. As for whether GCN has it implicitly, I do not know. Unless a wavefront gets successive issue cycles, a plain bypassing of the data for an imminent register writeback would not work without a secondary location to hold it.
Some CPUs are capable of forwarding in more than one cycle, but those have more complex scheduling and bypass capability.

I'm certainly not saying that AMD doesn't need to be careful with RF power. But it's worth remembering though how simple all coherent RF accesses are in GCN, to the extent that there's no need to implement banking within the RF.
There's still a power cost by virtue of its size being on the order of an L1 cache, which is a something that will not scale if capacity rises. The goal of quadrupling capacity puts each CU's register file on the order of an L2 cache. Even if the transistor density doubles that is more area, and interconnect scaling has been worse than transistor improvement at these geometries.

Off-chip allows multiple DS launches to be load-balanced across the chip, with a significant latency penalty and bandwidth cost. Is the latency so unhidable that the bandwidth range across the GCN lineup is not a notable influence on the synthetics?
 
Maxwell has explicit marking for caching reused registers, allowing for at least some accesses to the register file to be elided.
Is Maxwell RF banked?
This is also not an era where there are many low hanging fruits left, and turning one's nose up for years at less than spectacular quick fixes gives us the power/performance matchup we see today.
As soon as someone demonstrates that pure compute is more power efficient on one or the other competing architecture, we can have a some kind of discussion.
 
http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
Page 22: GDS global wave sync, ordered count ops.

I want these to DirectX and Vulkan soon. ROVs are nice, but I want these! :)

Yup! The GCN manual makes watery mouth.
It's a shame Microsoft didn't take the chance to actually allow inline assembly (or custom intrinsics) with 5.1, if they can not agree on adding contemporary functionality and instructions (not even some which can be emulated, like GatherLevel()). For me it's a big disappointment.
 
Is Maxwell RF banked?
It has four banks, regID modulo 4 for its mapping, per https://github.com/NervanaSystems/maxas/wiki/SGEMM

As soon as someone demonstrates that pure compute is more power efficient on one or the other competing architecture, we can have a some kind of discussion.
I do not have the sources necessary to tease out the conclusion from under all the confounding factors.

I find results like those from Anandtech's review suggestive: http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20.
This is from a card whose TDP is perhaps 20-30% lower, with inferior bandwidth.

It is also not the case that optimizations to the ALU and data movement are a benefit that can ignore graphics loads.
The more rigid encoding and static scheduling is closer to the VLIW5-VLIW4 era, which AMD has admitted tends to do well for in terms of performance and efficiency. Sure, it can make it hard for the shader compiler, but I don't know what to say since AMD also has consistency issues, with a large amount of evidence that it is the worse of the two competitors.

Despite what I believe to be a less advanced and slower to respond DVFS implementation, Maxwell turbos more, sustains its clocks better, performs better, and has 50-100W to spare.
Maybe a few watts came from the register file optimization, a few from the writeback caching, a few from the improved primitive distribution, a few from the more evolved compression, a few from the static dependence information, a few from--and so on.
 
It has four banks, regID modulo 4 for its mapping, per https://github.com/NervanaSystems/maxas/wiki/SGEMM
That banking severely constrains register access patterns, which necessitates some kind of ORF or hardware managed operand cache.

GCN RF doesn't need banking because it's just 256 2048-bit registers.

My proposal for increasing GCN RF capacity is to have banks locked to hardware thread IDs. With 4 banks there would be a minimum of 4 hardware threads per SIMD. At maximum, 2 hardware threads per bank, which enables 32 hardware threads per CU, versus 40 in current GCN. Which also means no additional constraints on intra-thread register access patterns.

In theory this layout for registers would hide a load of latency associated with RF<->memory operations, since there's significantly reduced contention on RF between ALUs and memory ports. When RF<->memory operations are running, there's a worst case 12.5% chance that they'll touch the RF that's currently feeding the ALUs. That's mostly likely to help texturing and LDS heavy kernels, I suppose, since they're both bursty in their RF interactions.

Without a model for power or timing or area, it's just wishful thinking though.
 
That banking severely constrains register access patterns, which necessitates some kind of ORF or hardware managed operand cache.

GCN RF doesn't need banking because it's just 256 2048-bit registers.
Whether an architecture needs a banking cache to avoid conflicts is not the question that was being asked when the idea for a statically or dynamically populated cache or register tier was originally mooted. The reduced contention was a possible bonus.
The question was whether driving the bit lines of 64KB of SRAM on every access was energetically more expensive than driving them for 1-2KB for a subset of accesses. If not 64KB, would it start to appeal at 128KB?

As one potential data point for the the direction the register files may take:
TSMC's 16nm FF+ process is apparently shifting to a 512-bits per line versus the current 256-bits SRAM scheme.
Whether that 256 has something to do with the current 256 for GCN's physical registers, or it's a happy coincidence, I do not know. I feel that a design that does try to take the storage per mm of its process to their limits like a GPU may have more than coincidence to thank for that correlation.
 
That banking severely constrains register access patterns, which necessitates some kind of ORF or hardware managed operand cache.

GCN RF doesn't need banking because it's just 256 2048-bit registers.

My proposal for increasing GCN RF capacity is to have banks locked to hardware thread IDs. With 4 banks there would be a minimum of 4 hardware threads per SIMD. At maximum, 2 hardware threads per bank, which enables 32 hardware threads per CU, versus 40 in current GCN. Which also means no additional constraints on intra-thread register access patterns.
I thought GCN had a distributed register file (256 4-byte registers x 4 interleaved set per lane) from day one. As far as I know, lanes are isolated from each other, as reflected by the ISA design, and all the cross-lane operations are done through the LDS network (some without the need of allocation). With these, I don't see why the register file would be a huge collection of 2048-bit registers in hardware.
 
I dont know if someone have the courage to do estimation on the size based on this photo.
IMG0047539_1.jpg

http://www.hardware.fr/medias/photos_news/00/47/IMG0047539_1.jpg

P1100270-Copy.JPG
 
Well its the biggest ever gpu made by either AMD or ATI to date....from what Ive seen at several different websites this thing is easily at least 600mm2 +
 
Back
Top