AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Discussion in 'Architecture and Products' started by iMacmatician, Apr 10, 2014.

Tags:
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    It's common to see comments about how Nvidia and AMD (and Intel when people still cared about CPUs) are still tuning clocks etc very short before release. And inevitable it's about how they can still go up.

    Here's my take on this: I've never seen silicon speeds go up after the first weeks of bring up. They always go down: corner silicon doesn't perform as expected, false paths rear their ugly heads on some samples etc.

    And second: going to mass production is a very drastic step with a lot of red tape. You do an initial trial production run and a larger volume trial run and you analyze all the failures. And, most important, you don't touch a single parameter. Definitely not clocks.

    Soalways take those comments about clocks not being final with a great deal of salt: it's very likely to be all in the imagination of the writer who has no clue. Especially 2 weeks before launch, when all parameters should have locked for many weeks.
     
    Kej, Tokelil, Lightman and 3 others like this.
  2. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,549
    http://www.hardwareluxx.com/index.p...ka-fury-x-slower-than-geforce-gtx-980-ti.html
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    At least in this proposal, this tier of registers is physically adjacent to the ALUs which provides them with as much or possibly more bandwidth than the primary register file.
    The wiring at that juncture might be too congested to get fancy enough to augment vector register bandwidth, something that a number of CPUs have done to make up for having fewer ports than their ALUs could consume at peak.
    There could be some nice side benefits if there were a way to do this, besides power.
    If one wavefront could successfully hide its accesses in the cache, the register file itself might be available for miscellaneous operations that need register access (LDS to VREG bypass, exports from VREG, etc.). The wavefront itself would not notice, but the CU overall might see better concurrency in getting movement on the other instruction queue types, or values from other domains, like scalar registers, might be able to be sourced more often after a move to the register cache/extended bypass.

    One question I have is that if the off-chip mode is not an Xbox-specific feature, why has that option not been exercised? There are definitely clear disparities in bandwidth between solutions that provide rather close benchmark numbers.
    The on-chip vs. off-chip dichotomy is something GCN's memory pipeline is a philosophical example of: an advanced functionality case that works trivially due to an unadventurous physical fixation, and an expensive un-evolved fallback.

    This may have come from the insistence that the CU arrays be so heavily decoupled, where movement to and from the fixed function domain is more of a straw than the compute domain is used to.
    Nvidia implemented an interconnect that distributed this more freely. Possibly, their implementation is able to spawn DS instances and clone the necessary parameters and contexts, while being able to provide a stream from the tessellator to the cloned instances.
    AMD does not seem to have this readily available, unless the DS CU is made so that it writes out all that data, and then the ostensibly elegant memory pipeline becomes the distributor. And then we find that this conventional memory system does not "push" data well, and the less-advanced cache and memory hierarchy are now unable to be hidden.


    In other cases, it may be that the source is operating at the end of a grapevine, where the rumor sites breathlessly as breaking news events that have long-since been resolved.
    Whether this GPU will be considered mass-produced for the X SKU or not, all speculation has been for a solution that is running on the edge for power consumption.
    AMD could be tweaking its turbo bins on silicon it has already validated on a range, or fixing its firmware. It may be that silicon never physically gets what is hoped for, but the complexity of the DVFS implementation--and possibly AMD flubbing this again (Jaguar to Kabini, Trinity to Richland, 7970 to 7970 GHz edition, Kaveri to Godavari, probably something in the 3xx series rebrand stack)-- could leave a lot of slack below that point.
    Possibly "working on clocks" is gauging the highest speed bin AMD can get enough of.
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
  5. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,535
    Likes Received:
    144
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Better link:

    https://research.nvidia.com/publication/compile-time-managed-multi-level-register-file-hierarchy

    The paper is years old. The baseline is NVidia's old, shockingly slow and inefficient compute units, with the absurd register v work group scoreboarding and lots of other nonsense that NVidia's now abandoned. On GPUs whose compute performance and density was terrible anyway. It's called low hanging fruit.

    AMD was using a register forwarding network (LRF in the paper) in the VLIW architecture. It is right there in the compiled code.

    I'm unclear on whether there's such a network in GCN. It's certainly not explicit in the compiled code. It seems doubtful. (I'm not trying to suggest that LRF is all that's in the paper.)

    I'm certainly not saying that AMD doesn't need to be careful with RF power. But it's worth remembering though how simple all coherent RF accesses are in GCN, to the extent that there's no need to implement banking within the RF.

    Indexing slows things down and almost certainly wastes power. Incoherently indexed registers are pretty rare in GPU code though.

    It has.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Maxwell has explicit marking for caching reused registers, allowing for at least some accesses to the register file to be elided.
    This is also not an era where there are many low hanging fruits left, and turning one's nose up for years at less than spectacular quick fixes gives us the power/performance matchup we see today.

    The VLIW exposed a network that most CPUs have implicitly. As for whether GCN has it implicitly, I do not know. Unless a wavefront gets successive issue cycles, a plain bypassing of the data for an imminent register writeback would not work without a secondary location to hold it.
    Some CPUs are capable of forwarding in more than one cycle, but those have more complex scheduling and bypass capability.

    There's still a power cost by virtue of its size being on the order of an L1 cache, which is a something that will not scale if capacity rises. The goal of quadrupling capacity puts each CU's register file on the order of an L2 cache. Even if the transistor density doubles that is more area, and interconnect scaling has been worse than transistor improvement at these geometries.

    Off-chip allows multiple DS launches to be load-balanced across the chip, with a significant latency penalty and bandwidth cost. Is the latency so unhidable that the bandwidth range across the GCN lineup is not a notable influence on the synthetics?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Is Maxwell RF banked?
    As soon as someone demonstrates that pure compute is more power efficient on one or the other competing architecture, we can have a some kind of discussion.
     
  9. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    948
    Likes Received:
    417
    Yup! The GCN manual makes watery mouth.
    It's a shame Microsoft didn't take the chance to actually allow inline assembly (or custom intrinsics) with 5.1, if they can not agree on adding contemporary functionality and instructions (not even some which can be emulated, like GatherLevel()). For me it's a big disappointment.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It has four banks, regID modulo 4 for its mapping, per https://github.com/NervanaSystems/maxas/wiki/SGEMM

    I do not have the sources necessary to tease out the conclusion from under all the confounding factors.

    I find results like those from Anandtech's review suggestive: http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/20.
    This is from a card whose TDP is perhaps 20-30% lower, with inferior bandwidth.

    It is also not the case that optimizations to the ALU and data movement are a benefit that can ignore graphics loads.
    The more rigid encoding and static scheduling is closer to the VLIW5-VLIW4 era, which AMD has admitted tends to do well for in terms of performance and efficiency. Sure, it can make it hard for the shader compiler, but I don't know what to say since AMD also has consistency issues, with a large amount of evidence that it is the worse of the two competitors.

    Despite what I believe to be a less advanced and slower to respond DVFS implementation, Maxwell turbos more, sustains its clocks better, performs better, and has 50-100W to spare.
    Maybe a few watts came from the register file optimization, a few from the writeback caching, a few from the improved primitive distribution, a few from the more evolved compression, a few from the static dependence information, a few from--and so on.
     
    Grall likes this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That banking severely constrains register access patterns, which necessitates some kind of ORF or hardware managed operand cache.

    GCN RF doesn't need banking because it's just 256 2048-bit registers.

    My proposal for increasing GCN RF capacity is to have banks locked to hardware thread IDs. With 4 banks there would be a minimum of 4 hardware threads per SIMD. At maximum, 2 hardware threads per bank, which enables 32 hardware threads per CU, versus 40 in current GCN. Which also means no additional constraints on intra-thread register access patterns.

    In theory this layout for registers would hide a load of latency associated with RF<->memory operations, since there's significantly reduced contention on RF between ALUs and memory ports. When RF<->memory operations are running, there's a worst case 12.5% chance that they'll touch the RF that's currently feeding the ALUs. That's mostly likely to help texturing and LDS heavy kernels, I suppose, since they're both bursty in their RF interactions.

    Without a model for power or timing or area, it's just wishful thinking though.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Whether an architecture needs a banking cache to avoid conflicts is not the question that was being asked when the idea for a statically or dynamically populated cache or register tier was originally mooted. The reduced contention was a possible bonus.
    The question was whether driving the bit lines of 64KB of SRAM on every access was energetically more expensive than driving them for 1-2KB for a subset of accesses. If not 64KB, would it start to appeal at 128KB?

    As one potential data point for the the direction the register files may take:
    TSMC's 16nm FF+ process is apparently shifting to a 512-bits per line versus the current 256-bits SRAM scheme.
    Whether that 256 has something to do with the current 256 for GCN's physical registers, or it's a happy coincidence, I do not know. I feel that a design that does try to take the storage per mm of its process to their limits like a GPU may have more than coincidence to thank for that correlation.
     
  13. gamervivek

    Regular

    Joined:
    Sep 13, 2008
    Messages:
    805
    Likes Received:
    320
    Location:
    india
  14. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    417
    Likes Received:
    381
    I thought GCN had a distributed register file (256 4-byte registers x 4 interleaved set per lane) from day one. As far as I know, lanes are isolated from each other, as reflected by the ISA design, and all the cross-lane operations are done through the LDS network (some without the need of allocation). With these, I don't see why the register file would be a huge collection of 2048-bit registers in hardware.
     
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    #1395 CarstenS, Jun 3, 2015
    Last edited: Jun 3, 2015
    Lightman and silent_guy like this.
  16. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
  17. Spyhawk

    Newcomer

    Joined:
    Oct 31, 2007
    Messages:
    76
    Likes Received:
    1
    Well its the biggest ever gpu made by either AMD or ATI to date....from what Ive seen at several different websites this thing is easily at least 600mm2 +
     
  18. Spyhawk

    Newcomer

    Joined:
    Oct 31, 2007
    Messages:
    76
    Likes Received:
    1
    iMacmatician likes this.
  19. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
  20. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...