AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Discussion in 'Architecture and Products' started by ToTTenTranz, Sep 20, 2016.

  1. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    84
    Likes Received:
    20

    Thank you.

    But still there is the actual data we do have in the slides. Also, during the presentation, where he is referencing Infinity Fabric 2.0 and lets us know about achieved low latencies, then later on quietly lets it drop that they also have unified memory. That later in 2020 hinted that We will be seeing more of this Fabric's abilities in the future.. (xGMI or layers of fabric on chiplet designs...?) Additionally, I am sure there has been further advancements/refinements to the HBCC touching the fabric.

    AMD's goal is heterogenous computing. Unified memory (of any sort) is their goal.



    Back to gaming.
    The problem with SLI/Crossfire in gaming was not the PCI Express bus & 2 video cards. It was that each GPU had it's own memory pool.... Thus, the Game's issues/problems was alter-frame-rendering, because there was no unified memory between GPUs.

    Now, if all the GPUs on a chiplet (not MCM) use the same memory pool (not having their own separate pool) at normal memory bandwidth & speeds. What about if there is a dedicated fabric/substrate (perhaps even unified L2 cache.??) then how is IF's bandwidth a problem. Yes, latencies matter, but can also be tailored to application.




    theory: if you sandwiched two GPU's back to back with IF in the middle, arbitrating...
     
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    I didn't mention latencies, I was discussing bandwidth and correct behavior. GPUs typically can tolerate longer latency, although there are certain elements of the architectures that do not the further you go upstream from the pixel shaders.

    The latencies are likely low relative to the prior routing through PCIe and the host, which could have prohibitive amounts of latency. The presentation on MI50 gave a 60-70ns latency for a GPU to GPU link, but I have some questions about whether that is the whole story.
    From https://gpuopen.com/gdc-2018-presentations/, there is a link to the slides for the Engine Optimization Hot Lap that discusses the latency of the GCN memory subsystem, where the L1 takes 114 cycles, the L2 takes 190, and a miss to atttached memory is 350. For a chip operating in the Vega clock ranges, the xGMI latency is on the same order as the L1 cache. If the cache hierarchy is involved, an access would generally go through most of the miss path from L1 to L2 to memory, the last stage may be routed to an external link.
    I've made allowance elsewhere for Vega 20 introducing changes in order to support the physically disjoint memory, but one possible way this can be worked into the subsystem is if it's managed by the HBCC with data migration and/or load bypassing many parts of the hierarchy. That's somewhat fine for specific instances of compute but less so for less structured graphics contexts.

    In that case, there is a difference between unified memory addressing and coherent/consistent accesses to it. Enforcement of consistent data or control behavior tends to be a heavy cost and a source of bugs.

    With regards to the cache hierarchy, I await details on how it handles the unified memory. That's a separate question from how the units that do not fully participate in the cache hierarchy do.
    The fabric can speed up parts of the process, but cannot change the internals of what it connects to. Changes to those internals may be significant enough to be reserved for a more significant architectural change, and the more complicated graphics parts are what can prevent the transparent gaming MCM GPU.

    The compute part of that is the key. For HSA, unified memory for compute is not new and didn't require IF.
    It is a necessary part of an MCM graphics solution, but not sufficient on its own to meet the criteria RTG gave for MCM gaming.

    Why have more than one GPU on the same chip?

    Each L2 slice is able to move 64B to and from the L1s per clock. For a chip like Vega 20 it's several times the memory bandwidth, and because GCN's architecture actively encourages L1 and L2 transfers it's not something that can be reduced.
    The area for each L2 interface would be on the order of Vega 10's memory bus, twice over because it's duplex unlike a memory channel. Power-wise, it's potentially lower per bit if stacked, but there would be tens of thousands of signals.

    If you want accesses in the unified memory of multiple GPUs to not throttle heavily, the connection between GPUs must have a significant fraction of the local bandwidth of each GPU. For Nvidia's research in MCM compute, reducing it down to something like "only" 1/2 or 2/3 the bandwidth (significant fractions of a TB/s or more) needs significant package engineering, a new cache architecture, and an existing write-back hierarchy. Best results come with software being written to take this into account.
    Link bandwidth for xGMI 2.0 isn't up to that range yet, and that range was evaluated for the more readily optimized compute workloads.
     
    DmitryKo, pharma and DavidGraham like this.
  3. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    84
    Likes Received:
    20

    Thank you once again.

    Yeah for comparison, but no need to concern yourself with AMD's CGN, because it is not the future of AMD. And.. if they designed their new architecture so that (lets say) that the L2 can be unified with as many GPU/cores as connected (Or at some other hierarchy), where is the problem you describe..? (Or caches themselves are much larger and hold more?)

    I am visualizing something perhaps as several HBCC's.. within the memory hierarchy, mitigating at each level.



    Additionally, you keep addressing bandwidth issues as edge-to-edge MCM. I do not consider AMD's "chiplet" design as MCM. It is a sandwich design & not necessarily an edge-to-edge design (ie wafer sitting next to wafer.) So the interconnects are not necessarily edge to edge, but.... per layer (or plane). Again, two GPU don't have to be side-by-side, but stacked on top of each other... sharing the same pinouts with fabric in middle... (otherwise two mirror GPU would have to be made to have exact pinout alignment.)

    I don't know how far AMD has taken IF2.0, but YES... I do b believe you can have - "but cannot change the internals of what it connects to" -

    That^^ idea
    Is the whole point of AMD's chiplet/fabric design. It is not a dumb interconnect, it can manage it's own data, and interposing, etc. And have different layering of each.



    For gaming, would you rather have 4096 cores @ 1,500MHz- Or, have 8192 cores @ 900MHz... ?
     
    DmitryKo likes this.
  4. Rys

    Rys AMD RTG
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,129
    Likes Received:
    1,302
    Location:
    Beyond3D HQ
    It's definitely a MCM. You don't have to redefine the terminology everyone has understood and used for a very long time now (well over 20 years in mass-market semiconductors).
     
  5. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    84
    Likes Received:
    20
    Yeah just like a Motorola MicroTAC is a smartphone...
     
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    Nvidia was testing both memory-bound and computation-bound workloads with their theoretical design - these were running non-realtime on their supercomputer simulator. The research paper cites these command-line CUDA workloads simply because these would be much easier to script and profile in this environment.

    All graphics tasks are still either compute-bound or memory bound (or both) - so profiling real games vs memory-bound CUDA tasks would not really change major design decisions about overcoming inefficiencies of far memory accesses.

    Yes, transparent NUMA access would require a redesign of the memory/cache controller to include write-back logic and an inter-processor cache coherency protocol. This would probably require breaking changes to the memory subsystem and register file, which would warrant a new microarchitecture and a different instruction set.

    However AMD should be in a much better position for successfully implementing these changes - they have in-house engineers working for several decades on multi-socket x86 processors with a coherent inter-processor link (HyperTransport), they have a working PCIe 4.0/xGMI implementation in Vega 20, they have 128 PCIe links in the EPYC Milan (ZEN3) processor (and PCIe 4.0 too?!), and they lead several industry consortiums that develop cache-coherent non-uniform memory access (ccNUMA) protocols such as Gen-Z, Cache Coherent Interconnect for Accelerators (CCIX), Coherent Accelerator Processor Interface (OpenCAPI) etc...

    ... and then in the real world, we have the "The AMD Execution Thread" saga.

    Well, there were rumors that AND diverted engineering resources from Vega to Navi, which should be either a completely new architecture or the last, heavily revised iteration of GCN . Would be a shame if these rumors would land in "AMD Execution Thread" again...

    One more time. We are talking about multi-socket non-uniform memory access. You have a memory-bound application which needs to access this very slow far memory - so you have to design your caches and inter-die links to provide as much performance as you can within your external pin, die size, and thermal power budgets.

    How exactly profiling real-world games would lead you to completely different engineering decisions comparing to profiling memory-bound CUDA tasks?

    Yep, it's rather non-uniform memory access, where each GPU would access the other GPU's local memory through inter-processor links.

    Exactly.
     
    #5806 DmitryKo, Jan 5, 2019
    Last edited: Jan 13, 2019
    Malo likes this.
  7. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    59
    Likes Received:
    39
    I believe Wang explained it well here :
    https://www.pcgamesn.com/amd-navi-monolithic-gpu-design
     
  8. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    No. He said that game developers avoid coding for explicit multi-GPU - as obviously very few gamers have more than one video card, unlike professional/blockchain/AI users - and that given proper OS support, a MCM-GPU with ccNUMA memory is very possible...

    ...which could be presented as a single video card to the developer, according to Nvidia research papers above, and thus the egg-chicken problem would be solved.
     
  9. Rys

    Rys AMD RTG
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,129
    Likes Received:
    1,302
    Location:
    Beyond3D HQ
    No, not like that.
     
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    Going back to the original claim, it was that transparent multi-GPU was stopped solely by the lack of unified memory, and that--given what we know now--there's nothing technically blocking this.
    My response was that unified memory wasn't the only obstacle, and what we currently know has architectural edges that remain a problem.

    If the answer to the "what we know now" question is that something new fixes these concerns with unknown methods, then fine. However, one of the elements I was discussing was that there is a set of hardware elements in the graphics pipeline that do not engage with the L2--either in part or fully. Whatever the L2 does wouldn't matter to them, so there's some other assumption built into this scenario not being discussed.

    Aside from that, just saying an L2 can be unified with an unbounded number of clients, no timeline, and without a clear description of where they are physically is too fuzzy to know where the bottlenecks or problems may be.
    The most probable improvements for various elements like silicon process, packaging tech, and architecture have some limits in what we can expect in terms of scaling in the near and mid-term.

    If you mean like AMD's Rome. It's an MCM in the standard sense.
    The involvement of passive and then active interposers in some of the future proposals can be called 2.5D or 3D integration at least by vendors in order to differentiate them from the more standard MCMs that are placed on more conventional substrates, but aside from the marketing distinction they are modules with more than one chip.

    I don't know if I follow what you mean by edge to edge in this case.

    Thermal problems and wire routing are an immediate concern. If you mean literally stacked one on top of the other, massive thermal issues and significant limitations in terms of how to get enough wires to the internals for power and data. These chips already have a massive pinout, and whatever is in the stack gets pierced by very large and very disruptive TSVs.

    If the GPUs are just on opposite sides of a PCB, it's a significantly more complicated thermal and mounting environment, plus a much more complex layering and wire congestion problem getting enough signals and power into the region. For all that, such a solution does not appear to be much different from having them side by side.

    Regardless of any of this, getting signals off-die has an area and power cost, regardless of the exact choice in location of the other die. Closer typically means lower cost, but taking what is currently on-die and massive in terms of signals makes even small costs significant. A global L2 that claims to be as unified for multiple chips as it was within one die is going to be wrestling with wires climbing from tens of nm in size to hundreds/thousands of nm.

    If the fabric changed what it attached to, it would hamper AMD's goal of using it as a general glue to combine IP blocks without forcing either the fabric or the IP block to change when they are combined. For these purposes, it is just an interconnect and cannot make non-coherent hardware coherent.

    That seems like a hole too significant to extrapolate over. If a workload is too complex for them to evaluate absent real-time performance requirements, why is it certain that solutions that are declared as working without having complexity or responsiveness as part of their evaluation are sufficient?

    Graphics tasks can also be synchronization or fixed-function bound, among other things, but that aside I would have liked confirmation that other elements of their workloads could be matched to a game or graphics workload.
    A graphics pipeline is a mixture of cooperative and hierarchical tasks, and one area I would like some comparison is whether the modeled cooperative thread array scheduling behavior they modeled fit. Did warp and kernel behaviors align in terms of their access patterns, dependences, and behaviors? Some of the compute workloads were characterized as being relatively clean in how they wrote to specific ranges or could work well with a first-touch page allocation strategy.
    Are graphics contexts as clean, or which compute workloads may have complex input/output behaviors that have been found well-handled?

    Do the non-evaluated graphics workloads have similar locality behaviors? Can the behavior swing enough to hinder the L1.5 or upset the mix of local/remote accesses? Not finding significant problems with unrelated workloads is not comfortable for me.

    Then there are architectural questions, such as with GCN. Does the remote-coherent memory hierarchy break elements like DCC? In that case, there would be graphics workloads that weren't limited by a resource in a monolithic context that become bottlenecked after. If there are scheduling methods or optimizations that avoid this, are they included in the CTA scheduling methods discussed?

    As a counterpoint, if the CTA scheduling was straightforward and transparent for the profiled workloads, how should I look at the explicitly exposed thread management of the geometry front end with task and mesh shaders? It might be nice if there were a profiled workload that would indicate that whatever problem those shaders address has been handled.
     
    mrcorbo and del42sa like this.
  11. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    59
    Likes Received:
    39
    and you seem fail to understand, that that there is difference between HPC computing and running 3D game. On HPC we already use more GPU´s to run task in parallel and it doesn´t really matter if those GPU´s are on separate boards/PCIe or if you just put them all on one PCB/Interposer.
     
    #5811 del42sa, Jan 6, 2019
    Last edited: Jan 6, 2019
  12. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    Lack of "shared memory" with "uniform memory access". Not "unified memory" or unified access.

    Every part of the pipeline has to read from memory, so the all data necessarily passes through the L2 cache - even fixed function parts like hierarchical Z-compression which probably have their own private caches not documented in the ISA document.

    It's perfectly fine to apply generalisations learned from the past 35-40 years that superscalar out-of-order microprocessors have been researched and marketed.

    So far it's not even certain that ccNUMA support would require modifications to the instruction format or a new microarchitecture to extend load-store microops to full 48-bit (or 52-bit) virtual memory space. Unless they really need a four-chip design each addressing a 1TB SSD, I'd guess their current 40-bit PA architecture would be good enough for another iteration.

    How running a memory-bound task in real-time would be different from running it non-realtime in a simulator?

    Chip developers do not really play commercial computer games in real-time to verify their design decisions. They would rather work with ISVs to profile their applications then recreate these workloads using scriptable non-realtime tasks. But honestly they don't even need to do that, since in-house programmers will make enough guidance as to the expected bottlenecks. And they probably have enough Altera FCPGA boards to test significant parts of the chip in the hardware simulator (still slow, but 'only' three orders of magnitude slower than the real thing) before actual tape-out.

    https://rys.sommefeldt.com/post/semiconductors-from-idea-to-product/
    https://rys.sommefeldt.com/post/drivers/

    Both memory-intensive and compute-intensive workloads were evaluated - they just give more details on memory-intensive tests since this was the object of the study, and compute-intensive workloads show little variation with memory bandwidth.

    Yes, it would be nice, but unfortunately they've just stopped a little short of starting a comprehensive developer's guide to modern high-performance GPUs :)

    Framebuffer and Z/W buffer access is one part that would be hard to optimize for locality. One strategy would be designating a 'master'
    GPU to hold a single local copy of the frame and Z/W buffers; ahother would be to ​
    let graphics driver relocate physical memory pages to another GPU on-demand based on performance counters. Tile-based rendering approach also comes to mind.
     
  13. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    Is that your answer to my question: "how would you optimize ccNUMA differently"?
     
  14. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    11,052
    Likes Received:
    6,733
    Location:
    Cleveland
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    DCC's hardware path was documented as not going through the L2. The ROP caches spill into the L2, but for non-ROP clients to read that data there's a heavy synchronization event. The command processor needs to be stalled, and the ROP caches and potentially the L2 flushed. Vega is documented as still potentially needing an L2 flush in certain configurations, if there is some kind of alignment mismatch between the RBEs and L2. This potentially is related to some of the smaller variants of the design.
    Vega's geometry engine was ambiguously described as having driver-selected L2-backed streamout and/or the more traditional parameter cache.
    Other elements of the graphics pipeline are incompletely described in how they interact.

    DCC is the most significantly separate data path disclosed. The others may use the L2, but that isn't sufficient to say they are coherent. The CU's vector memory path is considered coherent because the L1 is inclusive of the L2 and will write through immediately. Other clients, like the ROPs have different behaviors and require heavier flush events and stalls before their data can be accessed safely.
    These are not details a fabric like IF can change. The fabric can carry data and probes, but cannot force clients to generate them consistently.

    At present the hardware subject to these restrictions can monitor for status updates from their local pipeline elements for completion time. Historically, that did not include monitoring the readiness of hardware pipelines not directly wired to them.

    That history is after the functionality was consolidated onto one die. I was responding to a claim ambiguous as to whether this upscaled GCN L2 would be off-chip, since some of the statements made seemed to indicate they could be.
    A straightforward extrapolation of the relationship between on-die connections and solder balls is the expectation that data paths expand by several orders of magnitude in size and power, and this is before including the claim that it would then be used to connect to multiple other GPUs with just as many clients.


    Were there compute workloads profiled that had response time or QoS requirements comparable to a graphics workload? It may be possible to extrapolate partially if the testing framework kept track of the perceived timing within the simulation to determine it. It's still a jump saying that if an MCM can support an interactive workload on a characterized compute path N, it can work well with undefined hardware graphics path X.

    There can be variations in the reasons behind why a workload is is found to be bandwidth or ALU intensive. A large and uniform compute workload can have predictable behaviors and linear fetch behaviors. This can allow it to be tuned to fit within the limits of the L1.5's capacity, and can allow for wavefronts to be directed to launch where it is known they have the most locality or to adjust their behavior to fit the locale they find themselves in.

    Is there a similar level of control over the access patterns of a given graphics wavefront to do so as accurately? The GPU architectures historically dedicated a lot of hardware to provide consistent access at high bandwidth for a significant amount of uncorrelated behavior. That created an assumption about how important placement in the GPU was relative to other resources, and for much of the architecture it wasn't important. A CU sees a very uniform environment, and a lot of the tiling behaviors and data paths strive to distribute work to take advantage of uniformity that an MCM no longer has.
     
    #5815 3dilettante, Jan 9, 2019
    Last edited: Jan 9, 2019
    AlBran likes this.
  16. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,306
    Likes Received:
    3,963
    Honest question: why aren't we discussing Radeon VII based on Vega 20 on a thread that says "(..) Vega 20 rumors and discussion"?

    Radeon VII is most probably the last Vega GPU from AMD, so if we aren't talking about that GPU here, what's the point of this thread?
    Is it to trim thread size?
     
    pharma likes this.
  17. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    11,052
    Likes Received:
    6,733
    Location:
    Cleveland
    Because this one is more about rumors and now that its announced, its not a rumor?

    But mostly because multi-year mega threads suck.
     
    yuri, pharma and ToTTenTranz like this.
  18. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    Well, L1/L1.5/L2 hierarchy is just one of potential designs.

    According to the Vega architecture whitepaper and Vega ISA document, HBCC is already a full-blown cross-bar controller, with HBM memory and NVRAM memory interfaces, vritual page memory support, and Infinity Fabric interface to local L2 caches.

    So they might be able to remove the memory controller to a separate die, connected with L2 caches in each respective GPU chiplet through Infinity Fabric. This would require roughly the same number of chiplets in a four-part configuration - 1 memory/IO controller, 4 HBM2/HBM3, and 4 GPU. This way L2 remains on each GPU, and the crossbar/memory controller may implement advanced coherency protocols and/or additional cache memory to mitigate possible stalls.

    Or they might be able to design sufficiently fast Infinity Fabric connections so that each separate memory controller acts like a memory channel in a global distributed crossbar/memory controller.


    Let's consider the 9-die setup above (with 4 HBM2 modules, 1 memory IO, and 4 GPU). Would it really require "orders of magnitude" in comparison to existing Vega design?

    Let's just conclude that these results are not convincing enough for you. We wouldn't really know if these details would result in a substantially different real-world performance, without access to Nvidia's library of commercial-grade building blocks and expensive simulation hardware, save for the ability to tape-out a working prototype.

    Here is the full list of memory-intensive workloads Nvidia used for the two research papers above and their memory footprints (from tables 4 and 2 of respective publications).
    I fully trust you to devote sufficient time for thorough analysis of these workloads; please come back to us with complete detailed results. :twisted:
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    7,982
    Likes Received:
    2,428
    Location:
    Well within 3d
    If the discussion is about hardware that doesn't really use the caches in a consistent fashion anyway, changes to cache hierarchy don't affect them.
    The research opted to evaluate workloads and architectures that do not use those elements.

    This is outside the GPU's core area, and the IF interface isolates the sides from each other. It is designed to track references and perform data moves on occasional faults, but it doesn't implement a cache coherence protocol since it works in response to outbound accesses from the L2.

    The L2 itself is not designed to be snooped, so it wouldn't be a peer in a basic coherence protocol. If a client doesn't perform many functions expected of a coherent client, it won't perform them just because it is plugged into a bigger interconnect.
    In this chiplet architecture, the IO die has 4 HBM interfaces and at least 4 links to the satellite GPU chiplets. The IO die would be large, with two sides dominated by HBM stacks, and only two other sides for GPU links. Going by the presentation on Rome, a lot of the perimeter of a large IO die is for IF links, when the DDR system bandwidth is 5-10X below the proposed HBM system bandwidth. Any solution trying to leverage those links needs to adjust for a load an order of magnitude larger.

    Power-wise, the off-die links will consume additional power--tens of watts just to move from the IO die to the GPU die in addition to the existing HBM's power cost. There isn't discussion about how all the graphics pipeline events would be handled, since many were direct signals or internal buffer updates on-die with no corresponding memory or IF messaging, and the individual L2s would be incoherent with each other in a way that GCN does not allow architecturally.


    I was addressing an ambiguous design scenario where the L2 was possibly off-die while still serving the same role as it does when adjacent to the CUs. That's potentially 16 or more L2 slices per GPU, with each capable of 64B in each direction, with 4 or more GPUs apparently able to plug into it. Package or interposer links have their data paths in the hundreds or tens of nm, and ubumps have 40-50 microns for their pitch. 64K interposer traces, TSVs, or ubumps is more pad space than GPU.

    I think they pointedly avoided evaluated graphics workloads. Since they work for a vendor whose business is still heavily invested in pixel pushing, not mentioning pixels once seems like it wasn't an accident.

    I am not familiar with many of those workloads, though some of the HPC loads and the initial entries I ran through google may point to well-structured problems. Neural net training is treated at a hardware level like a lot of matrix multiplication. Stream and HPC workloads often focus on matrix math. The bitcoin-type entry is a lot of very homogenous hashing. Which ones share which behaviors with the graphics pipeline is unclear to me. My position is that if they don't mention graphics as being applicable, I am not inclined to make the leap for them.
     
    Silent_Buddha and 3dcgi like this.
  20. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    589
    Likes Received:
    386
    Location:
    55°38′33″ N, 37°28′37″ E
    The point is, larger caches and/or faster links with improved protocols may allow further unification - like it happened in Vega10 for pixel and geometry engines which were connected to the L2 cache.

    A matter for implementation then, and AMD are experts in cache coherent multiprocessor systems.

    But how this would be different from the current architecture?

    Vega whitepaper and Vega ISA document imply that L2 is split between separate memory controllers/channels and interconnected through IF/crossbar.

    What if we scale down the package to 2 HBM and 2 GPU dies and a 7nm respin of the IO die?

    Could lower voltage logic maintain signal integrity?

    Let's agree to disagree.
    I'd fully expect them to have full access to a comprehensive suite of proprietary tests used on the same simulator hardware during the course of actual GPU development. Just like I'd expect them to withhold that information from the public.
    And no, I wasn't really expecting you to actually inspect these workloads.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...