AMD: Pirate Islands (R* 3** series) Speculation/Rumor Thread

Discussion in 'Architecture and Products' started by iMacmatician, Apr 10, 2014.

Tags:
  1. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Yes, the CPU<->GPU coherence is not yet perfect. However we have not seen a problem with that, since we sidestepped that problem completely, by moving the whole graphics engine to the GPU. We don't need tight communication between the rendering and the game logic (physics, AI, etc). Obviously for general purpose computing, it would be a great improvement to get a shared CPU+GPU L3 cache and good cache coherence between the units (without needing to flush caches frequently or use slower BW bus). With "units" I mean both the CPU and the various parts of the GPU (such as the front end, vector units, scalar unit, etc). More coherence = less flushes needed = less stalls.

    On GPU side, the atomics are going directly to L2. That could be improved to make some cases faster. However in practice we use LDS atomics locally, and then perform one global atomic to synchronize. This greatly reduces the number of global atomics you need. GCN LDS is highly optimized for atomics. GCN also has super fast cross lane operations, allowing the developer to sidestep the need of atomics (and LDS accesses) in many common cases. Unfortunately only OpenCL 2.0 exposes this feature on PC (subgroup operations, see here: http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/). I am still sad that this feature didn't get included in DirectX 12 :(

    I would also love to see ordered atomics in DirectX some day :)
    Heh, I was going to say that I want full float instruction set and full LDS read/write instruction set... but that I don't need memory write support :)

    But I see some important uses cases for scalar unit memory writes. Especially if the scalar unit cache is not going to be coherent with the SIMD caches.

    I am not a hardware engineer, so I don't know that much about the hardware level power saving mechanisms. Those are often fully transparent to the programmer. I just try to get maximum performance out of the hardware. This is why I suggest things that allow the programmer to write new algorithms that perform faster. All kinds of hardware optimizations that allow shutting down unneeded transistors are obviously we much welcome.

    Register caching sounds like a good idea. Majority of the registers are just holding data that is not needed right now, as the GPUs don't have register renaming and register spilling to stack. Register cache should allow a considerably larger larger register file (little bit more far away), while keeping the performance and power consumption similar. This would definitely help GCN.
    16 bit fields are also sufficient for unit vector processing (big parts of the lighting formulas) and for color calculations (almost all post processing). No need to waste double amount of bits if you don't need them.
     
  2. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    3dilettante and a few other hardware oriented guys (sorry forgot who) were discussing the GCN tessellator and geometry shader design in another thread year or two ago. The conclusion of that discussion was that the load balancing needs improvements (hull, domain and vertex shader have load balancing issues because of the data passing between these shader stages). These guys could pop-in and give their in-depth knowledge. I have not programmed that much for DX10 Radeons (as consoles skipped DX10). GCN seems to share a lot of geometry processing design with the 5000/6000 series Radeons.

    I have programmed a 5870 Radeon at home, and that GPU is super slow in many things that GCN handles without any problems. For example dynamic indexing of a buffer (SRV) in a shader makes my test code 6x slower compared to constant buffer indexing (even for very small cache friendly data sets). Needless to say, I really like some of the improvements in GCN.
     
    TheAlSpark likes this.
  3. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    702
    Likes Received:
    588
    Location:
    55°38′33″ N, 37°28′37″ E
    Call it "ALU" or "simple core" - as long as each of them still has its own register memory, it's a matter of classification.

    Intel approach may be more "pure", but it is necessarily so to give developers the ability to program against the familiar x64/AVX512 instruction set. On the other hand, Xeon Phi 7100 costs $4000 and gives you ~1.2 TFLOPS consuming 300 W, while Hawaii (R9 290X) is $500 and ~6 TFLOPS for the same 300 W, at least theoretically.

    It's only been 25 years since I first saw similar theoretical papers explaining how RISC/VLIW is so much better than CISC for improving ILP :)
     
  4. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    A better compiler would reduce the VGPR usage. No hardware modifications required. If the scalar unit was more robust and the compiler was even better, scalar offloading would additionally reduce the VGPR usage a bit. I would also want to see full support for 16 bit integer and 16 bit float types (to halve the RF usage of these types of data). When these cheap improvements (HW wise) are exhausted, I would love to see a bigger RF size per work item. Register cache would be a good solution to allow a bigger register file with no performance issues and no power issues. But I don't think we need 4x bigger RF, if we get all these other improvements first. In this case, I would be perfectly happy with 2x larger RF per work item.
    Logically GCN SIMT is 64 wide. Waves are 64 wide. Branching and memory waiting happens at wave granularity. The SIMD execution units however are 16 wide. One instruction is executed in 4 cycles (in pipelined round robin manner). AMD has good papers and presentations about the GCN execution model available.
    Older AMD GPUs did this as well. Jump instruction could check the execution mask, if mask filled with zero bits jump over the code completely. Xbox 360 HLSL even had some high level constructs to control whether you wanted only a jump or jump + predication. See here: https://msdn.microsoft.com/en-us/library/bb313975(v=xnagamestudio.31).aspx
     
    TheAlSpark likes this.
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    Multiple, in fact. That was their Mantra from mid-2011 on.
     
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    702
    Likes Received:
    588
    Location:
    55°38′33″ N, 37°28′37″ E
    So this again boils down to AMD's conscious underengineering approach to driver performance and native code generator optimisations...


    Sorry, I don't think it's a good idea from a hardware designer point of view.

    While registers and cache memory serve the same purpose, i.e. provide fast transistor-based memory to offset slow capacitor-based RAM, they belong to very different parts of the microarchitecture and use very different design techniques to implement.

    Registers are part of the instruction set architecture, they are much closer to the ALU and are directly accessible from the microcode, so they are "hard-coded" as transistor logic. Caches are a part of the external memory controller, accessible through the memory address bus only, designed in blocks of
    "cells", and they are transparent to the programmer and not really a part of the ISA.

    While it may initially seem attractive for some reasons to implement a "register file" as a continous block of fast SRAM cache and allocate registers on-demand, the utter complexity of such implementation will kill any benefits, more so than just adding more registers to the ISA or implementing a common pool/file with registers renaming. Not to mention that registers give you very fast 1-clock access, while even the fastest L1 caches require a few clocks.
     
    TheAlSpark and Grall like this.
  7. AlexV

    AlexV Heteroscedasticitate
    Moderator Veteran

    Joined:
    Mar 15, 2005
    Messages:
    2,528
    Likes Received:
    107
    The "call it whatever you want to, it's OK, naming is academic anyway" attitude is an issue (and part of why more mature fields have issues taking the whole GPGPU story seriously). The terms are not substitutable, and the difference goes back more than 25 years - really, it's a solved issue. I fail to see what is relevant in the pricing of respective cards, and counting FLOPs is also misleading. They are in different markets - FirePro is comparatively priced. The discriminant is not whether an ALU has or does not have some SRAM attached (sidenote, even the way in which the RF is partitioned, which is not per-lane independent, should be an indication). Amusingly, if the most solid illusion of "an ALU is a core" is provided by Intel's Gen, but that's a pretty...eccentric piece of hardware to begin with. Finally, if you have time parsing the linked presentation might be useful. Especially since in some regards Glew is more supportive of your views than mine. Since we're moving the discussion in a different direction, I wonder whether or not I should fork this (B3D denizens, opinions?).
     
    CarstenS likes this.
  8. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    502
    Likes Received:
    135
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    Cores is the new MEGAhurtz (since ca. 2006).
     
  10. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    OC result? At least you cannot buy those from SK-Hynix right now.
     
  12. Nemo

    Newcomer

    Joined:
    Sep 15, 2012
    Messages:
    125
    Likes Received:
    23
    Сustom version of HBM by AMD&SK-Hynix, I think
     
  13. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    546
    Likes Received:
    182
    Hynix's public part catalog has been outdated or just plain wrong before. In the short term, there is one company that is going to buy HBM from them, so they have no need to have very accurate catalogs.

    However, isn't the reported clock of the first-gen HBM going to be the non-DDR clock, or 500ish MHz?
     
  14. HKS

    HKS
    Newcomer

    Joined:
    Apr 26, 2007
    Messages:
    31
    Likes Received:
    14
    Location:
    Norway
    Or a fake.
    Why would AMD confirm to several news sites that 4GB is maximum if it isn't true?
     
  15. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    502
    Likes Received:
    135
    It is most likely fake. As for AMD feeding BS to news sites to mislead competition, not completely unlikely.
     
  16. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,172
    Location:
    La-la land
    Outright lying to media would be very much frowned upon pretty much universally I would think.

    Better to just say "no comment" if you don't want to confirm the amount of RAM your new unannounced graphics card has.
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    Not very convincing, IMO. Maybe their catalogues have been wrong, but there's HBM listed in there right now, albeit with 1 Gbps (500 MHz). The slower 0,8 Gbps-HBM got removed lately, but they forgot to add the more interesting higher speed grade? I am not convinced.
     
  18. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    919
    Did AMD actually go on record saying this?
     
  19. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    502
    Likes Received:
    135
    I'm not sure if they ever explicitly said its gonna be 4 GB, period? AFAIK they kinda bullshited around the question implying it's gonna be 4 GB.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...