AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    308
    Likes Received:
    80
    So each Compute Unit has 64 threads (SPs) but can be split into two for two issues of 32 SPs each, and then each workgroup has some "local data share" block that does... something with cache/IO/whatever. Alright then. So higher level does look like those odd one off mobile Vegas with 20 CUs for macbooks, it's just the naming conventions threw me off. Thanks!
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    Just a bunch of things that occurred to me in the order I skimmed the slides (no worries if not covered):
    Is there a clear count on the number of shader engines? The diagram seems to have the GPU divided into two, but could it be four going by the way the CU arrays are arranged in blocks of 4 with their own rasterizer and primitive blocks?
    What specifically is in the purview of the primitive unit, rasterizer, and geometry processor? What's been "centralized" in Navi versus how shader engines each had a geometry processor?
    The GCN CU had a branch/message block that the RDNA diagram didn't include. Any mention of that in the briefing, or just artistic oversight?
    Was there more detail on the what changed with the LDS now that it's apparently spanning two CUs?
    Perhaps too esoteric, but any discussion about GCN instructions that might have been removed or changed due to the SIMD changes (ex. DPP has various 16-wide strides, LLVM does mention possibly discarding some instructions for handling branches or skipping instructions (VSKIP, fork and join instructions)?
    Did they mention register bank conflicts or a register reuse cache in the briefing?
    The slide on the RDNA SIMD Unit states "up to 20 wave controllers". Does this mean up to 20 wavefronts could be assigned to a single SIMD?
    Details on the L0 and L1 caches? Is the L0 just the old L1 with a new name? How many accesses or how many WGPs can the L1 service per clock?
    Is this a write-through hierarchy from L0 to L1 to L2? Did AMD outline possible changes in how it handles cache updates or memory ordering?
    The centralized geometry processor has 4 prim units, but those are distributed across shader engines?
    What does it mean by uniformly handling vertex reuse, primitive assembly, index reset, etc.? Does that mean that single geometry processor block does all of that, or it's responsible for farming out surface/vertex shaders across the die.
    It uniformly distributes pre and post tessellation work, so it controls where the hull and domain shaders go? Where's the tessellation part located?
    Async Compute Tunneling: what makes it able to drive down the amount of lower-priority work so completely? Is it able to context-switch existing waves out, or if not, what made GCN less effective in draining wavefronts?

    Even though the node gains are usually more modest than marketing would purport, AMD's slide has the node giving ~15% of a 50% improvement to the node, which translates to a single-digit percentage gain attributed to the node jump from GF's 14nm. It seems AMD ate up most of the possible improvement by staying well past the point of diminishing return for circuit performance versus power consumption.

    There may be some elements similar to the claims, though some like a register destination cache are mentioned in the LLVM commits and not in the slide deck. I'm not sure how much the RDNA SIMD layout matches. For one, the ALU arrays are described in terms of being narrower than SIMD16, with SIMD8 being mentioned as an example. The patent's suggestion of a superscalar or long instruction word encoding to help capture unused register cycles isn't mentioned, nor is there a clear instance of an instruction being split across one type of SIMD block at the same time as another--the SFU in the slide deck appears to operate in parallel and independently of the main ALUs. There appears to be a mode that promotes a SIMD32 issue into a two-cycle form, not one that allows for two instructions to issue simultaneously on the same SIMD.
    The register file and register destination cache figure more heavily in a separate register file patent that is less dependent on the SIMD organization, and there is brief mention of elements from it like the register cache and register bank conflicts. The super SIMD patent didn't necessarily require that the banking was visible to software, and it seemed to regard the registers as being addressed as rows stretching across all four banks, rather than each bank's row being a different register ID.

    The LLVM code changes include latency figures that indicate that the overall process of reading operands from the register file and issuing an instruction is 1 cycle longer than it used to be, regardless of the addressing mode.
    AMD's slides don't mention bank conflicts based on the register ID, though the LLVM changes do. AMD's instruction issue slide doesn't appear to be using conflicting register IDs, however.

    The traditional GCN microarchitecture matched issue latency to execution and forwarding latency, whereas Navi's implementation has removed the issue limitation. GCN's long-held promise of near-zero thought having to be dedicated to ALU and forwarding has been abandoned. This, plus the extra vector register read cycle, may point to streamlining some of the internal pipelining needed to get everything flowing in the 4-cycle cadence, and perhaps a sacrifice in circuit depth per stage to get better clock speed with a 5-cycle latency pipeline.
    The most immediate upshot seems to be that the scalar unit's much quicker execution and forwarding latency are no longer dependent on the vector path's cadence. If not crossing domains, it might allow for stretches of setup code to run chains of serial scalar code without that stall.
    The LLVM speed model is interesting in that it has mostly the same latency numbers as prior generations, just multiplied in cycle count by 4--aside from the vector ALU path having an extra register file cycle. In that regard, there's an updated instruction issue and sequencing element to the uarch, but it's plausible that many of the pipeline paths haven't diverged from the 4-cycle execution and forwarding pattern.

    It does seem like a big factor is the ability for a single workgroup to leverage 4x the bandwidth than would have been possible in the past with single-CU workgroup allocation. It seems like it might help versus the competition since Volta and Turing did upgrade their cache bandwidth. If the LDS is shared, I wonder if this also means the workgroup barriers that used to be per-CU are also shared. The LLVM changes do mention there are subtleties to accessing two L0 caches, given the memory hierarchy's weak consistency.
    The slide with the caches and LDS seems a little ambiguous as to how much the LDS has been upgraded, and if it might have tweaks to let a workgroup shuffle data little more efficiently between its two sides.
    Without putting much thought into it, I wonder if some of the more complex merged shader stages like primitive shaders might benefit from this. One half of a workgroup could start the culling process and use the LDS to hand non-culled vertices to the the vertex processing half of the same shader, rather than having to switch back and forth between shader phases.

    The cache slide shows just how much bandwidth there is internal to the cache hierarchy, which I hope people pay heed to when planning a chiplet-based future. The infinity fabric is still at the far end, handing data from one channel to the nearest L2 slice, mostly. The L2/L1 fabric carries more than twice the bandwidth, and a single dual-CU set of data paths could generate more than half of the bandwidth of the whole GDDR6 subsystem--and each L1 has 5 of them.
    Trying to take any of the internal groupings out of the die is even more expensive than it was with GCN.
    Some of my earlier questions about how heavily loaded the L2/L1 fabric was appears to be answered by the addition of the L1, and clients like the RBEs and rasterizer hang off the L1 rather the L2.
    It doesn't seem like Navi has entirely dispensed with the old way of distributing work to the shader engines and rasterizers, since the geometry block has arrows to them that do not go through a cache.

    With respect to the lower-latency hierarchy:
    Per a GDC 2018 presentation on engine optimizations (https://gpuopen.com/gdc-2018-presentations/, one by Timothy Lottes), a GCN vector L1 hit has a latency of ~114 cycles. So I guess RDNA dropping its cache hit latency down to 90 cycles is better than nothing. Going by the reverse-engineering of Volta (https://arxiv.org/pdf/1804.06826.pdf) AMD is about three times slower rather than about four times slower at the L1. GPU L1 aren't speed demons, obviously, and Volta appears to have latencies close to GCN at the L2 and beyond.
     
    AlBran and Lightman like this.
  3. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    142
    Likes Received:
    77
    Sounds like kind of Bulldozer to me :-D

    [​IMG]
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    For the most part, if you look at GCN prior it would have most of those features. GCN shared its front end and scalar caches between up to 4 CUs.
    The LDS spanning two CUs is new, since it's storage and a software-visible address range that the hardware isn't isolating to a single CU like it once did.

    One thing I'm curious about is how the GCN CU went from an apparently single scheduler block to there being schedulers per SIMD.
    I suspect that while GCN did have a fair chunk of scheduling hardware that rotated between SIMDs, the various SIMDs and other blocks may have had some minor sequencing to get them through to the next issue cycle.

    What scheduling or issue capability still physically connects the SIMDs in RDNA, since the more they autonomously decide on their execution path and scheduling, the more they appear like a core?
     
  5. rSkip

    Newcomer

    Joined:
    Jan 10, 2012
    Messages:
    8
    Likes Received:
    14
    Location:
    Shanghai
    SIMD & Wave execution:
    GCN: CU has 4 x SIMD16, Wave64 execute on SIMD16 x 4cycles.
    RDNA: CU has 2 x SIMD32, Wave32 execute on SIMD32 x 1cycles.

    LDS:
    GCN: 10 Wave64 on Each SIMD16, 2560 threads per CU. 2560 threads (1CU) share 64KB LDS.
    RDNA: 20 Wave32 on Each SIMD32, 1280 threads per CU. 2560 threads (2CU) share 64KB LDS.

    Shared cache:
    GCN: 4 schedulers & 4 scalar units (4CU) share I$, K$
    RDNA: 4 schedulers & 4 scalar units (2CU) share I$, K$
     
    DeeJayBump, Cat Merc, iamw and 7 others like this.
  6. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,099
    Likes Received:
    533
    Location:
    France
    I'm waiting for reviews. But the best case scénario benchs they showed were "meh", prices same thing. Look like a good competitor against 1080 non ti... They're late as usual. But reviews will tell, I hope I'm wrong.

    "Gaming hz", love this one..
     
  7. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    12,047
    Likes Received:
    8,163
    Location:
    Cleveland
    Lets leave wishes, desires, and talk of other products out of this thread, and get back to AMD's Navi...
     
  8. Ryan Smith

    Regular Subscriber

    Joined:
    Mar 26, 2010
    Messages:
    605
    Likes Received:
    1,020
    Location:
    PCIe x16_1
    Most of what you're asking is beyond what I was briefed on and is beyond my own expertise. But I'll answer what I can.
    Yes. There are 2 shader engines.
    I believe a lot of this is stylistic, but a lot of work has gone into improving their work (re)distribution. It's something I need to look more into.

    A new feature called Priority Tunneling has been added. Notably, this is not context switching. But it does allow the AWS to go to the top of the execution pipeline and block any new work being issued, so that it can be drained and a compute workload started immediately thereafter.
     
    AlBran and Lightman like this.
  9. Globalisateur

    Globalisateur Globby
    Veteran Regular

    Joined:
    Nov 6, 2013
    Messages:
    2,778
    Likes Received:
    1,542
    Location:
    France
    Radeon 5700 has 4 SE with 10 workgroup processors each, 40 CUs in total and 4 CUs deactivated. How do they deactivate 4 CUs in the 5700 if the CUs are grouped by 2 with shared cache ? Can they deactivate only one CU in a Workgroup Processor ? If yes what happens with the shared cache ?
     
  10. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,283
    Likes Received:
    224
    It has 2 shader engines.
     
    Globalisateur likes this.
  11. Globalisateur

    Globalisateur Globby
    Veteran Regular

    Joined:
    Nov 6, 2013
    Messages:
    2,778
    Likes Received:
    1,542
    Location:
    France
    Thanks ! how did I miss this ? :roll:
     
  12. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    336
    Likes Received:
    294
    It's not a "dual compute unit" to start with. It's 4 SIMD units with a native wave size of 32 merged into one compute unit. And while the slides don't state it, it appears reasonable to assume that everything right of the LDS isn't actually bound to a specific SIMD unit / pair, but shared for the whole CU.

    Looks like an artistic choice, as the 2x32 slice was apparently easier to compare to the previous 4x16 configuration, than the full 4x32 configuration. Take everything on slide 13 x2, and you have the real numbers for RDNA. This factor is then represented in slide 20.

    I must assume that the "2x registers" and "2x ALU" is limited to the VGPR registers / vector units, and refers to each subgroup of 32 / 64 threads being strictly local to a SIMD unit. Scalar registers and also scalar instructions may not have been kept to a single scalar unit, but rather fully replicated to up to all 4 SIMD groups.
     
    BRiT likes this.
  13. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    308
    Likes Received:
    80
    And I thought mine were a nightmare to respond too... :p
    Anyway the whole thing seems to be extremely hierarchical now, perhaps for cost reasons though maybe they're hoping for chiplets at some point. Compute units are two issue 32 thread wavefronts. Data share is now between two of these compute units. Five of which are in each upper block which share L1 cache and rasterizer, etc. Two of which are in a "shader engine", which is separate from the geometry processor etc. What exactly constitutes a "Shader Engine" then isn't clear to me either, unless the L2 Cache and memory controllers are accessible by the whole "Shader Engine" instead of by half. But the diagrams do appear accurate, so what you see is indeed what you seem to get.

    Perfectly true, power efficiency in current finfet node advances is always advanced far more than any available frequency increase. Not exactly sure of the physics here, some tipping point in the finfet gate structure makes power draw just go exponential at around the same frequency regardless of feature size. I'd expect the upcoming Zen 2 mobile/Navi 20cu cards to be far, far more power efficient even than mobile Vega was. So Intel's "we can match AMD in mobile GPU!" claim isn't going to last long at all.

    This is why I'd guess at no GPU chiplets anytime soon. The CPU chiplets AMD uses with Zen 2 have zero direct interconnects, it's all Infinity Fabric, and that bandwidth just isn't enough for a GPU.

    In fact cache problems in modern architectures remind me of the rocket equation. Specifically the cache could be equated to fuel in a rocket, cache grows exponentially compared to logic growing linearly, so eventually cache is just going to start dominating die space altogether versus logic. Then at some point not far after you just can't make improvements at all, as for every N logic improvements you make you might need 2N, or N^2 or whatever bigger cache just to feed new instructions to the logic. Other vendors aren't immune from this either, Nvidia has an ever growing cache verse logic problem as well, and from what I recall a huge reason Apple CPUs are so fast is all the work on their memory/cache systems.

    Some big, very different looking change is going to need to happen in regards to accessing memory if computers are going to keep getting faster. I can hardly imagine what you'd need if graphene or some other 2d material replaces silicon and suddenly you can clock to over a 100ghz+.
     
    Ike Turner likes this.
  14. ttnuagmada

    Joined:
    Jun 12, 2019
    Messages:
    1
    Likes Received:
    0
    Anyone have any insight as to why the transistor count nearly doubled over Polaris?
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,069
    Likes Received:
    2,739
    Location:
    Well within 3d
    So it would seem that in prior GPUs the high-priority queue could arbitrate for most--but not all--of the workgroup launch slots that would become available during the lifetime of a high-priority task? Then the AWS can more completely monopolize the shader engines.

    If by right of the LDS you mean the texture blocks and L0, I think there is some evidence that those are not shared.
    The LLVM changes specifically point out that in workgroup processor mode that the two halves of a WGP will not see a consistent view of memory because the L0 is per-CU and they cannot see possibly inconsistent versions of data in the other L0. Cache invalidation or some other kind of synchronization is necessary to get correct behavior out of vector memory accesses in that mode.
    There may also be some other subtleties to the hardware IDs and resource management that recognize the CUs separately. Everything to the left of the LDS seems to be more independent already, so the difference between a wavefront running independently on a SIMD may seem mostly the same whether it's independent of the SIMD next to it or the SIMD in the next CU. Whether there are operations, side effects, or hardware settings that might have more immediate effect within a CU aren't clear at this point.

    All nodes have an inflection point where power consumption goes super-linear, though finFETs may have a more pronounced rise past it.
    AMD and other have warned that certain circuit parameters are not improving much, such as wire resistance and capacitance.
    The wire component is not governed by the gate type, but finFETs do complicate the latter.
    Zen 2's designers commented that even the modest clock gains for the new core did take special effort to achieve in the face of poorer scaling of some facets of circuit performance. GCN's clock speeds are still far from the realm of those CPUs, but it's likely working with transistors sized for higher density and deals with a pipeline design that has more layers of logic and more distance for signals to travel versus CPU cores that tune things more narrowly.

    The challenge is data movement, both in moving enough of it and moving it as little a distance as practical. In that regard, logic can easily scale demand without concerning itself with the question of how it can be fed efficiently.
    GPUs don't quite hit the cache levels of CPUs because they focus on a particularly mathematically dense set of workloads, and also have higher demands in terms of raw bandwidth versus CPU caches whose hit rates are driven more by latency.

    More L2, additional L1s, more scalar hardware, more features in the CU, new memory type, more command processor and geometry hardware, more ROPs.
    Higher clock targets can mean more transistors, as most of Vega's transistor gains over Fury were credited to buffers and wire-delay improvements rather than extra features.
    The new node may have favored new implementations of logic blocks that added to the transistor count versus 14nm.
     
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,777
    Likes Received:
    2,021
    Location:
    Germany
    2 SE, 4 Arrays (not pictured), data can in fact transition between arrays, has to stay in same SE, though.
     
    Digidi likes this.
  17. Urian

    Regular

    Joined:
    Aug 23, 2003
    Messages:
    621
    Likes Received:
    55
    Am I the only one who sees an almost Tile Renderer?
     
  18. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    142
    Likes Received:
    77
    To be precise, yes. There are 2 Shader Engine, with 4 clusters total (each Shader Engine has two), each SE contains 10 Workgroup Processors , each WP contains of two CU, one CU has 64 SP

    ??
     
  19. Pinstripe

    Newcomer

    Joined:
    Feb 24, 2013
    Messages:
    53
    Likes Received:
    22
    Do we know anything about DX12.1 feature levels? Is it still all Tier levels up, like Vega, or any regression? We know VRS won't be supported, anything else?
     
  20. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    1,459
    Likes Received:
    636
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...