AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. firstminion

    Newcomer

    Joined:
    Aug 7, 2013
    Messages:
    217
    Likes Received:
    46
    Lightman likes this.
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It simplifies what the software has to do in order to support the multi-chip GPU. The cache invalidation process is heavyweight enough that I'm not sure about the extent of the delay that could be added by another GPU performing the same invalidation. The microcode processors and separate hierarchies wouldn't normally be trying to check externally, so there would be additional synchronization on top of that.

    Zen's coherent caches and generalized fabric make it generally the case that execution will proceed equivalently regardless of which CCX or chip its processes may span. It's true that it doesn't pretend to be one core, but that's where the graphics context differs. A GPU has a heavier context and abstraction, since it is actually a much less cohesive mass of processors.


    The infinity fabric itself is agnostic to the topology that might be chosen for an implementation. If the topology in question that moves ROPs off-chip differs from the more recently discussed systems with constrained connectivity to the stacks, it can avoid the general-purpose memory from being blockaded by the mostly unidirectional traffic of the export bus. The challenge of taking a broad bus off-chip is that the orders of magnitude worse power and physical constrains reveal what costs the design used to treat as effectively "free", or what elements of the design are not prepared to deal with arbitration involving a sea of formerly invisible competing clients if it's all going over a single general-purpose connection.


    GPUs generally tolerate latency in the data path to DRAM. GCN's wait counts give something of an idea of the relative weights the architecture has between the vector path to heavily contended external memory, and bespoke internal paths the architecture has hardwired limits on. I would expect a change in the costs and behavior of exports may require an evaluation of the bits devoted to export counts. Vega has shifted things substantially in favor tolerating vector memory stalls, given the extra bits it gave to VMCNT.

    AMD has shown a desire to push the ratio of intra-PIM bandwidth to external bandwidth as high as it can, since the peak bandwidth, power efficiency, and ability to adjust the implementation details internally are greater. How extreme would the ratio have to be in order to deeply cut into the external path's effectiveness? The logic base die would likely be working generally shorter lengths and feature dimensions 30-100x finer than the interposer (5-7nm vs ~65nm?), and the bump pitches are orders of magnitude worse than that.
    The vertical path to the DRAM, required power/ground, and die area lost to the vertical interfacing is linked to the limits of the external connectivity of a JEDEC-defined DRAM standard.

    The most recent discussion centered on solutions without interposers--which for this use case are even worse-off, but that aside I am fine with 2.5D and/or 3D integration and manufacturing.
    If there is a specific form that I have doubts about, it is the use of un-elaborated 2.5D silicon interposers specified for hosting HBM stacks--and in some ways my jaundiced opinion is specifically so for AMD.
    I get why that was the choice, but it's a solution that provides an optimization in one area--trace density, and causes significant de-optimizations across many axes. The various other interesting directions that an interposer implementation could take are left un-exercised, and the tech isn't sufficiently capable to do much more than what it accomplishes.
    All the while, we're seeing the progression of solutions that avoid, remove, or minimize the amount of silicon used becoming good enough to supplant it for HBM connectivity.

    The spartan 2.5D AMD uses is insufficient for its purposes, which its various proposals tacitly admit. Its supposed doubling down with chiplets and active interposers has a much more serious risk profile and a long time horizon, however.

    There's still latency. The shader engine front end pipeline and workgroup/wavefront arbitration path are significantly less parallel. Their interactions with more of the traditional graphics abstractions and command processor elements relative to CU-dominated compute means there are interactions with elements with low concurrency and wide scope.
    The ROPs traditionally were meant to brute-force their high utilization of the DRAM bus with their assembly-line handling of import/export of tiles, which is less about having the capacity to handle unpredictable latency than it is having a generally consistent stream coming over the export bus.

    The L1s are rather small and thrash-prone to isolate them more than mitigate the impact of on-die L2 latency and the current scatter/gather. Possibly corner cases with lookup tables/textures, distant entries in a hierarchical detail structure, compression metadata, etc. The coalescing behavior of the L1 is also rather limited, last I checked. Adjacent accesses within a given wave instruction could be coalesced, but even mildly out of order yet still equivalently coalesce-able addresses might translate into separate L2 hits.

    Atomics, although they could be presumed to be part of the L2 if it's being placed in the PIM.
    Any sufficient level of incoherent or poorly cache-aligned gather/scatter is going to inflate the amount of data going in and out versus the registers passed to the VM pipeline.
    Sufficiently complex filtering, given the automatic stride calculation and expansion of a few base addresses to multiple neighbors/LODs. Complex or new formats/compression techniques could switch out with the PIM base rather than fiddling with the CU.
    While we're at it, possibly any vector writes that do not completely align with and fully overwrite a cache line--given GCN's current write through and eviction policies.
    Writes in general? Just parity in the CU coupled with a read-only cache could allow ECC to be an HPC PIM function.

    Other elements need to be weighed as to whether they make sense in terms of bandwidth demand, whether they are generally useful to be tied to the memory any implementation needs, whether it makes sense for them to be duplicated, and the space premium for a PIM's base die.

    Still, putting the L2 at a remove over another connection makes the limited buffering capacity of the L1 less effective, and the costs of the connection would encourage some additional capacity. The shared instruction and scalar caches may need some extra capacity or backing store, given that vector memory handling is the main focus of GCN's latency hiding.

    One relative exception could be a video display engine or media processor, which might benefit from proximity in otherwise idle scenarios. That might point to a specific media/IO chiplet with a more modest buffer. Virtualization/security co-processors might do better separate.


    The natural tendency is to stripe addresses across channels to provide more uniform access behavior and to higher bandwidth utilization. An EPYC-like MCM solution that keeps the usual physical association between L2 slices and their nearest channels would stripe a roughly equal number of accesses to local and remote L2s.
     
  3. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    859
    Likes Received:
    262
    Well, thread migration sucks, always. There is no processor where this is free or even cheap.
    Threaded contention of the cache-lines mapping the same memory region sucks even more. There is only slow coherence over time, which is automatic, true, but very expensive.

    In the end, a multi-threat core needs to be spoken too in a very different manner than a single core - even ignoring the change of the algorithm.

    You rightly say that this stuff is way worst for GPUs, even if you would treat is as a multi-"core"/"rasterizer" GPU ... but only if you insists that the algorithm's optimal minimum should stay at the same spot - that is, you don't make different assumptions of optimality and you don't change the algorithm, you run it basically in emulation mode.
    Few software engineer do that. When the architecture offers you another more optimal minimum, you rewrite your algorithms. The kernels of high performing algorithms have permutations for even minor architecture changes.

    Who would wants this transparent behaviour? And who would see a chance to make the problem being solved faster with the very same amount of traditional GPU resources distributed over more disparate "cores", even compensating some some degree of coherency overhead? I would believe the latter fraction is larger.
     
    Jawed likes this.
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    So finally, in Vega, AMD made all graphics blocks clients of L2. AMD took its sweet time making this change. So, yes, it is clearly something you do cautiously and PIM with remote L2s adds another layer of complexity.

    We could ask why AMD has stayed away from this for so long when it's been a key part of NVidia's GPUs since, erm ... Fuzzy memory alert. A long time... Was NVidia's adoption of an L2 based architecture the fundamental reason it killed AMD with Maxwell onwards? Allowing NVidia to add fixed-function algorithms to the GPU underwritten by having paths to L2 with enough bandwidth across the entire chip. Distributed, load-balancing rasterisation (relies upon geometry flowing easily), delta colour compression, tiled rasterisation. All these things rely upon L2 being at the heart of the architecture, as far as I can tell.

    And performance per watt, despite the "disadvantages" (whatever they are) of L2 instead of dedicated paths and buffers seen in traditional GPUs, was significantly improved. L2 might be difficult, but done right it appears to have been a cornerstone of NVidia's architecture.

    Vega just looks like a half-broken first attempt with many "anomalies" in bandwidth-sensitive scenarios and awesomely bad performance per watt.

    I'm thinking about compute algorithms here, rather than graphics. The issue I see is that while it might be possible to have say 4:1 intra-PIM:extra-PIM, compute algorithms running outside of PIM (on ALUs which are too hot for a stack of memory) may not tolerate that. So taking Vega as an example, let's say the ROPs have 400GB/s to themselves and the rest of the chip has to share 100GB/s. That will surely hurt compute, but might be completely fine for graphics.

    I think you see this the same way, so we're in general agreement. I don't have any concrete arguments for or against the practicalities at any given ratio.

    One could observe that the logic die in a PIM actually has far greater bandwidth available to it from the stack of DRAM. In HBM the extra-module bandwidth is a function of the interface size, the page sizes, refresh timings, bank timings, burst sizes, turn-around latencies, buffer configurations and on and on and on. What if L2 inside PIM gets a 2x boost in effective bandwidth? Could it be even more? If so, then the extra-PIM bandwidth may be rather more generous than we're expecting.

    Those functions aren't latency sensitive though.

    ROPs do very little work on the fragment data they receive and spend very little time doing that work (deciding which fragments to write, blending/updating MSAA and depth/stencil).

    There's a queue on the input and that's it: you get a maximum fillrate defined by the pixel format and rasterisation rate. Depending on the format/content of the fragment data, the processing time will vary, but the there's only so many combinations. There is a maximum rate at which fragments can arrive (since the rasteriser is the bottleneck) and for a lot of the time fragments will be arriving more slowly and erratically. For a game at given graphics settings it's common for the frame rate to vary by an order of magnitude.

    Obviously some parts of a frame are entirely fillrate limited, like shadow buffer rendering. But ROPs and the paths to them have been sized according to an overall balance which includes a compromise over the range of fragment:vertex ratios and available off-die bandwidth.

    Yes, precisely. L1s are more "cache by name" than "cache by nature". At least for TEX. But for writes there's more value. It's really hard to find anything really concrete.

    https://gpuopen.com/gdc2017-advanced-shader-programming-on-gcn/

    There's some slides early in that deck that talk about throughputs and lifetimes. But I can't find much to enrich our discussion from there.

    They're definitely allowed to be bigger...

    AMD has recently been increasing instruction cache sizes. Constant cache is another major performance enhancer...

    Naively, it's very tempting to characterise a chiplet approach as consisting of three types:
    • an interface chiplet - PCI Express, pixel output, video and audio engines, top-level hardware scheduling
    • shader engine chiplets - CU with TEX
    • PIMs - ROPs, BR
    To be honest it looks too simple and too complex at the same time. Too simple: ignoring practicalities, it seems to follow how GPUs are architected. Too complex: inter-chiplet bandwidth is non-trivial even with an interposer and PIM is a whole new manufacturing problem.

    So, apart from observing in this posting that PIM functions (ROPs, say) might actually gain bandwidth from being on the "right" side of the HBM interface (close to the memory stack, not close to the CUs), I'm afraid to say it still all seems like wishful thinking.
     
    Malo likes this.
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sigh, seems I've gone full circle right back to the start of the thread. So, erm, literally no progress.
     
  6. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Maybe part of the issue compared to Nvidia comes back to how well implemented cache coherency is *shrug*; context here being for tiled rather than CPU-GPU.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    It's a matter of degree.
    Thread migration is comparatively expensive, although for many general CPU workloads the CCX or thread migration penalties are minor, and for latency-sensitive games the penalties are measured in tens of percent. Individual access penalties could be in the hundreds of cycles.
    The GPU situation is the lack of migration, data corruption/instability, or penalties that can go hundreds of thousands of cycles or more.

    For the graphics domain elements in question, there is no coherence or correctness ever. Without going to a higher driver/OS function, they're either wrong or physically incapable of contending.

    At issue here is that CPU semantics have well-defined behaviors and rules for multiple processors. At the hardware level, there is an implementation of both single-processor and multi-processor function as parts of the hardware-enforced model presented to software and developers.

    Outside of the cluster that is SLI/Xfire, the GPU's graphics context presents an abstraction that hews more closely to the single entity model (albeit with a much larger unit context), with very poorly defined or missing behaviors for the multi-chip case in-hardware, or a punt to the driver/OS/developer levels for basic consistency resolution.

    These products are intended to be sold to a customer base that would have almost purely legacy software relative to an architecture that mandates explicit multi-adapter as the basis for any future improvement, and the reality of the graphics portion is that this is entrusting the architecture to the work ethic of the likes of EA and Ubisoft, or the average level of competence of software on Steam.

    Even if this could be relied upon, the architecture's multi-peer model would need to be well defined and communicated to the community at large. The CPU vendors gave lead times sometimes amounting to years for notable changes in topology or semantics, and at least within ISA families the rules are mostly consistent with differences that don't manifest without taking some low-level approaches around the established system utilities or libraries.

    Those who've seen what the industry does if there's even a remote amount of fragility or have software written prior to the introduction of said product.
    Game production managers would prefer the former. Tim Sweeney has on a number of occasions cited productivity as the limiter, going so far as to state a willingness to trade performance for productivity in development.
    Getting even to a decent level of multi-core coding has taken decades at this point, with all those elements that make it as transparent as possible in architectures vastly more robust than GPUs have.

    I'd be interested to see AMD try at it, but at present I think the abstraction pushed by the GPU is too high-level, large, and complex for it cleanly define multi-peer behavior or to make it even as approachable as CPU multi-threading.

    Fermi introduced the read/write cache hierarchy and L2. It didn't massively change Nvidia's client fortunes, but might have had long-term economic effects with compute. Maxwell's significant change seems to stem from Nvidia's pulling techniques from its mobile lines into the mainline, with bandwidth and power efficiency gains. Some of those techniques turned out to be early versions of DCC and tile-based rendering, which did heavily leverage the L2 in ways AMD did not prior to Vega.
    Other elements are probably a change in the relative level of physical and architectural optimization and iteration.

    For Nvidia, it does seem like many of those do rely heavily on the L2. Does geometry flow through the L2?
    With Vega, I'm not sure how much it uses the L2 specifically, as it is still possible that DCC is in a similar place for spill/fill from the ROP caches or in the path to the memory controller, and it's not clear if the bin heuristics in the Linux drivers reflect graphics-specific hardware limits versus the L2. The code seems concerned with shader engine and associated hardware limitations, while in theory L2 could be more flexible in relation to those units.

    Vega's relatively lower level of success in leveraging the techniques Nvidia has running through its L2 may mean some of AMD's methods don't use it the way Nvidia does. Perhaps if AMD did, some of its problems wouldn't be happening.

    It could be a half-attempt, given some of its anomalies seem to align with not disturbing the old paths.

    One thing I noted when comparing the earlier TOP-PIM concepts versus AMD's exascale chiplets is that there may have been a change in AMD's expectations of processing in memory.
    TOP-PIM posited up to 160 GB/s external bandwidth versus the 640 GB/s internal to the stack. That was the most optimistic ratio between external and internal bandwidth.
    AMD's exascale APU didn't seem to note a different bandwidth for external/internal bandwidth, and also seems to expect at most a doubling of external bandwidth from HBM2 in the next 5+ years.

    One possibility is that TOP-PIM's estimate may have been too generous on the top end, and that either way HBM2 or its next generation version would have been close enough at that point.
    Additionally, TOP-PIM is conceptually closer to hybrid memory cube in what proposed for the logic layer and memory interface. The latest version's external bandwidth is 480 GB/s in aggregate, but the DRAM stack can only be driven to 320 GB/s, which is perhaps ominously what TOP-PIM had as a pessimistic trendline.
    This also aligns with an Nvidia presentation on bandwidth scaling, where the current and near-future DRAM standards have so well-optimized interface power draw that DRAM array and device ceilings dominate and scale linearly with bandwidth.
    Even if a PIM could quadruple HBM2/3/4's external bandwidth internally, unless something is done about the DRAM itself there may be limited gains permitted in-stack because the arrays are saturated.

    I was thinking of DICE's method for triangle culling in compute citing the risks of not quashing empty indirect draw calls, since the command would be submitted to the command processor, which would then sit on it while reading the indirect draw count from memory. The whole GPU could not hide more than a handful of those.
    Graphics context state rolls are another event developers are told to avoid due to the small number of concurrent contexts and the latency it would incur to drain the pipeline and load another.

    I recall Vega reduced the maximum number of CUs that could share an instruction front end to allow for higher clocks. That's a modest increase per-CU in a maximal shader array, though I think the instruction cache is still the same size.

    Going by AMD's exascale concept, items like PCIe might be built into the interposer as well.
    The active interposer would significantly help with the area cost of going with chiplets, since the NOC would be mostly moved into the lower die.

    There could be ancillary blocks whose purposes aren't fully disclosed that might go in one or the other chiplet. For example, AMD's claims on various functions in GPUs like context switching, compression, or load balancing very frequently allow for the possibility for yet another processing block. It might be decided on a case by case basis, and may have to weight them by which hardware they are most tightly linked to.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I remember L2 being key in the tessellation data flow.

    https://techreport.com/review/30328/amd-radeon-rx-480-graphics-card-reviewed/2

    So, not an increase in cache size, but a tweak to how the cache is used.
     
    CarstenS likes this.
  9. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    Actually just like he said, they reduced the number of CUs that can share same instruction & constant cache from 4 to 3, while the cache itself is the same size as before.
     
  10. _cat

    Newcomer

    Joined:
    Oct 29, 2015
    Messages:
    12
    Likes Received:
    8
    YES Post-Transform Geometry stays on-Chip in the L2.
    i am not allowed to post links, yet. but i try
    https://www.techpowerup.com/img/17-03-01/f34e39b49c7c.jpg
     
    Jawed, Lightman and pharma like this.
  11. _cat

    Newcomer

    Joined:
    Oct 29, 2015
    Messages:
    12
    Likes Received:
    8
    first i´m a non-pro, most of you are pros

    My concern is, that GCN as a whole doesn´t allow more than 4 Shader-Engines and because of that there will never be the Front-end-performance mostly at geometry, world-transform, culling etc. like what nvidia does since Maxwell or even Kepler.

    Some synthetic tests are showing that the cull-rate at about 50% on kepler is nearly as high as the 100% cull-rate, while Tahiti is like +10% of the 0% cull-rate (miles away from the 100%-cull value)

    So if the first perf. for the vertices and 3d-scene is too low or the culling-perf is too low or both takes too long, the first thing is: you added latency
    and i whould not want to add unneccesary latency with the first action i make
     
    Grall likes this.
  12. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    And there is a bit of emphasis on the cache coherency in that slide, which is why I wonder if AMD solution is just not as flexible or efficient in this context.
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The Vega white paper has some interesting wording on the topic (from the primitive shader section):

    Vega maintains and enlarges the pre-existing parameter cache, but adds L2 caching. I'm not sure from the wording if this is an automatic spill/fill or an either/or proposition given that the driver is apparently making a decision on which one to use. Nvidia's slide seems to point towards just committing to one thing.
     
    AlBran and Grall like this.
  14. _cat

    Newcomer

    Joined:
    Oct 29, 2015
    Messages:
    12
    Likes Received:
    8
    *the geometry engine can now use the on-chip L2 cache*
    maybe it´s me but that sound not like a 100% that it always does ... but it can

    *graphics driver to choose the optimal path*

    All modern CUDA capable cards (Fermi architecture and later) have a fully coherent L2 Cache.
    http://supercomputingblog.com/cuda/cuda-memory-and-cache-architecture/
    Vega is AMDs first try.
     
    Jawed likes this.
  15. _cat

    Newcomer

    Joined:
    Oct 29, 2015
    Messages:
    12
    Likes Received:
    8
    Fermi launched 7 Years ago, nvidia learned a bit in that time.
     
  16. pixeljetstream

    Newcomer

    Joined:
    Dec 7, 2013
    Messages:
    30
    Likes Received:
    60
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The part of the Nvidia presentation I was thinking of was the attribute and vertex shader portions running in the L1, then the next slide has triangles distributed via crossbar to the raster units.
     
  18. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,435
    Likes Received:
    263
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,930
    Likes Received:
    1,626
    [​IMG]




    http://pixeljetstream.blogspot.de/2015/02/life-of-triangle-nvidias-logical.html

    Edit: @pixeljetstream, welcome and thanks for posting interesting info on your website. It also seems you have contributed quite a bit to GTC as well.
    http://on-demand-gtc.gputechconf.co...&sessionYear=&sessionFormat=&submit=&select=+
     
    #319 pharma, Dec 20, 2017
    Last edited: Dec 20, 2017
    CSI PC and Jawed like this.
  20. Infinisearch

    Veteran Regular

    Joined:
    Jul 22, 2004
    Messages:
    739
    Likes Received:
    139
    Location:
    USA
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...