AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

w0lfram · Jan 1, 2019

3dilettante said:
Aside from that, there are decisions that have to be made with regards to what direction to devote sufficient resources, unless the argument is that it's merely a matter of devoting sufficient resources to every direction they could possibly go in.
The success or market adoption are not guaranteed, and can also leave a workable direction with sufficient resources abandoned.

AMD thus far seems to be open to a direction that is conceptually more easily implemented and lower-risk direction, and hasn't fleshed out future prospects as fully as some competing proposals. Those competing proposals have laid out more architectural changes than AMD has thought up, and yet more would be needed to go as far as AMD says they need to for MCM graphics in gaming.

That's one of many barriers to adoption. Being able to address memory locations consistently doesn't resolve the need for sufficient bandwidth to/from them, or the complexities of parts of the architecture that do not view memory in the same manner as the more straightforward vector memory path.

Infinity Fabric is an extension of coherent Hypertransport, and the way the data fabric functions includes probes and responses meant to support the kinds of shared multiprocessor cache operations of x86. As of right now, GCN caches are far more primitive and seem to only use part of the capability. There are a number of possibilities that fall short of full participation in the memory space.
The other parts of the GPU system that do not play well with the primary memory path are not helped by IF. It is a coherent fabric meant to connect coherent clients, it cannot make non-coherent clients coherent. The fabric also doesn't make the load balancing and control/data hazards of multi-GPU go away.

The HBCC with its translation and migration hardware is a coarser way of interacting with memory. One thing I do recall at least early on is that while it was more automatic in its management of resources, at least early on the driver-level management of NVLink page migration had significantly higher bandwidth. I have not seen a comparison in more modern times.
HBCC may be necessary for the kind of coarse workloads GPUs tend to be given, yet it's something not cited as being necessary on the CPU architectures that have had scalable coherent fabrics for a decade or more. The CPU domain is apparently considered robust enough to interface more directly, whereas the other domain is trying to automate a fair amount of special-handling without changing the architecture that needs hand-holding.

There may be changes in the xGMI-equipped Vega that have not been disclosed. Absent those changes, going by what we know of the GCN architecture significant portions of its memory hierarchy do not work well with non-unified memory. The GPU caches depend on an L2 for managing visibility that architecturally does not work with memory channels it is not directly linked to, and whose model doesn't work if more than one cache can hold the same data.
Elements of the graphics domain are also uncertain. DCC is incoherent within its own memory hierarchy, the Vega whitepaper discusses the geometry engine being able to stream out data using the L2, but in concert with a traditional and incoherent parameter cache. The ROPs are L2 clients in Vega, but they treat the L2 as a spill path that inconsistently receives updates from them.

Thank you.

But still there is the actual data we do have in the slides. Also, during the presentation, where he is referencing Infinity Fabric 2.0 and lets us know about achieved low latencies, then later on quietly lets it drop that they also have unified memory. That later in 2020 hinted that We will be seeing more of this Fabric's abilities in the future.. (xGMI or layers of fabric on chiplet designs...?) Additionally, I am sure there has been further advancements/refinements to the HBCC touching the fabric.

AMD's goal is heterogenous computing. Unified memory (of any sort) is their goal.

Back to gaming.
The problem with SLI/Crossfire in gaming was not the PCI Express bus & 2 video cards. It was that each GPU had it's own memory pool.... Thus, the Game's issues/problems was alter-frame-rendering, because there was no unified memory between GPUs.

Now, if all the GPUs on a chiplet (not MCM) use the same memory pool (not having their own separate pool) at normal memory bandwidth & speeds. What about if there is a dedicated fabric/substrate (perhaps even unified L2 cache.??) then how is IF's bandwidth a problem. Yes, latencies matter, but can also be tailored to application.

theory: if you sandwiched two GPU's back to back with IF in the middle, arbitrating...

3dilettante · Jan 1, 2019

w0lfram said:
But still there is the actual data we do have in the slides. Also, during the presentation, where he is referencing Infinity Fabric 2.0 and lets us know about achieved low latencies, then later on quietly lets it drop that they also have unified memory.

I didn't mention latencies, I was discussing bandwidth and correct behavior. GPUs typically can tolerate longer latency, although there are certain elements of the architectures that do not the further you go upstream from the pixel shaders.

The latencies are likely low relative to the prior routing through PCIe and the host, which could have prohibitive amounts of latency. The presentation on MI50 gave a 60-70ns latency for a GPU to GPU link, but I have some questions about whether that is the whole story.
From https://gpuopen.com/gdc-2018-presentations/, there is a link to the slides for the Engine Optimization Hot Lap that discusses the latency of the GCN memory subsystem, where the L1 takes 114 cycles, the L2 takes 190, and a miss to atttached memory is 350. For a chip operating in the Vega clock ranges, the xGMI latency is on the same order as the L1 cache. If the cache hierarchy is involved, an access would generally go through most of the miss path from L1 to L2 to memory, the last stage may be routed to an external link.
I've made allowance elsewhere for Vega 20 introducing changes in order to support the physically disjoint memory, but one possible way this can be worked into the subsystem is if it's managed by the HBCC with data migration and/or load bypassing many parts of the hierarchy. That's somewhat fine for specific instances of compute but less so for less structured graphics contexts.

In that case, there is a difference between unified memory addressing and coherent/consistent accesses to it. Enforcement of consistent data or control behavior tends to be a heavy cost and a source of bugs.

With regards to the cache hierarchy, I await details on how it handles the unified memory. That's a separate question from how the units that do not fully participate in the cache hierarchy do.
The fabric can speed up parts of the process, but cannot change the internals of what it connects to. Changes to those internals may be significant enough to be reserved for a more significant architectural change, and the more complicated graphics parts are what can prevent the transparent gaming MCM GPU.

AMD's goal is heterogenous computing. Unified memory (of any sort) is their goal.

The compute part of that is the key. For HSA, unified memory for compute is not new and didn't require IF.
It is a necessary part of an MCM graphics solution, but not sufficient on its own to meet the criteria RTG gave for MCM gaming.

Now, if all the GPUs on a chiplet (not MCM) use the same memory pool (not having their own separate pool) at normal memory bandwidth & speeds.

Why have more than one GPU on the same chip?

What about if there is a dedicated fabric/substrate (perhaps even unified L2 cache.??)

Each L2 slice is able to move 64B to and from the L1s per clock. For a chip like Vega 20 it's several times the memory bandwidth, and because GCN's architecture actively encourages L1 and L2 transfers it's not something that can be reduced.
The area for each L2 interface would be on the order of Vega 10's memory bus, twice over because it's duplex unlike a memory channel. Power-wise, it's potentially lower per bit if stacked, but there would be tens of thousands of signals.

then how is IF's bandwidth a problem. Yes, latencies matter, but can also be tailored to application.

If you want accesses in the unified memory of multiple GPUs to not throttle heavily, the connection between GPUs must have a significant fraction of the local bandwidth of each GPU. For Nvidia's research in MCM compute, reducing it down to something like "only" 1/2 or 2/3 the bandwidth (significant fractions of a TB/s or more) needs significant package engineering, a new cache architecture, and an existing write-back hierarchy. Best results come with software being written to take this into account.
Link bandwidth for xGMI 2.0 isn't up to that range yet, and that range was evaluated for the more readily optimized compute workloads.

w0lfram · Jan 5, 2019

3dilettante said:
I didn't mention latencies, I was discussing bandwidth and correct behavior. GPUs typically can tolerate longer latency, although there are certain elements of the architectures that do not the further you go upstream from the pixel shaders.

The latencies are likely low relative to the prior routing through PCIe and the host, which could have prohibitive amounts of latency. The presentation on MI50 gave a 60-70ns latency for a GPU to GPU link, but I have some questions about whether that is the whole story.

From https://gpuopen.com/gdc-2018-presentations/, there is a link to the slides for the Engine Optimization Hot Lap that discusses the latency of the GCN memory subsystem, where the L1 takes 114 cycles, the L2 takes 190, and a miss to atttached memory is 350. For a chip operating in the Vega clock ranges, the xGMI latency is on the same order as the L1 cache. If the cache hierarchy is involved, an access would generally go through most of the miss path from L1 to L2 to memory, the last stage may be routed to an external link.
I've made allowance elsewhere for Vega 20 introducing changes in order to support the physically disjoint memory, but one possible way this can be worked into the subsystem is if it's managed by the HBCC with data migration and/or load bypassing many parts of the hierarchy. That's somewhat fine for specific instances of compute but less so for less structured graphics contexts.

In that case, there is a difference between unified memory addressing and coherent/consistent accesses to it. Enforcement of consistent data or control behavior tends to be a heavy cost and a source of bugs.

With regards to the cache hierarchy, I await details on how it handles the unified memory. That's a separate question from how the units that do not fully participate in the cache hierarchy do.
The fabric can speed up parts of the process, but cannot change the internals of what it connects to. Changes to those internals may be significant enough to be reserved for a more significant architectural change, and the more complicated graphics parts are what can prevent the transparent gaming MCM GPU.

The compute part of that is the key. For HSA, unified memory for compute is not new and didn't require IF.
It is a necessary part of an MCM graphics solution, but not sufficient on its own to meet the criteria RTG gave for MCM gaming.

Why have more than one GPU on the same chip?

Each L2 slice is able to move 64B to and from the L1s per clock. For a chip like Vega 20 it's several times the memory bandwidth, and because GCN's architecture actively encourages L1 and L2 transfers it's not something that can be reduced.
The area for each L2 interface would be on the order of Vega 10's memory bus, twice over because it's duplex unlike a memory channel. Power-wise, it's potentially lower per bit if stacked, but there would be tens of thousands of signals.

If you want accesses in the unified memory of multiple GPUs to not throttle heavily, the connection between GPUs must have a significant fraction of the local bandwidth of each GPU. For Nvidia's research in MCM compute, reducing it down to something like "only" 1/2 or 2/3 the bandwidth (significant fractions of a TB/s or more) needs significant package engineering, a new cache architecture, and an existing write-back hierarchy. Best results come with software being written to take this into account.
Link bandwidth for xGMI 2.0 isn't up to that range yet, and that range was evaluated for the more readily optimized compute workloads.

Thank you once again.

Yeah for comparison, but no need to concern yourself with AMD's CGN, because it is not the future of AMD. And.. if they designed their new architecture so that (lets say) that the L2 can be unified with as many GPU/cores as connected (Or at some other hierarchy), where is the problem you describe..? (Or caches themselves are much larger and hold more?)

I am visualizing something perhaps as several HBCC's.. within the memory hierarchy, mitigating at each level.

Additionally, you keep addressing bandwidth issues as edge-to-edge MCM. I do not consider AMD's "chiplet" design as MCM. It is a sandwich design & not necessarily an edge-to-edge design (ie wafer sitting next to wafer.) So the interconnects are not necessarily edge to edge, but.... per layer (or plane). Again, two GPU don't have to be side-by-side, but stacked on top of each other... sharing the same pinouts with fabric in middle... (otherwise two mirror GPU would have to be made to have exact pinout alignment.)

I don't know how far AMD has taken IF2.0, but YES... I do b believe you can have - "but cannot change the internals of what it connects to" -

That^^ idea
Is the whole point of AMD's chiplet/fabric design. It is not a dumb interconnect, it can manage it's own data, and interposing, etc. And have different layering of each.

For gaming, would you rather have 4096 cores @ 1,500MHz- Or, have 8192 cores @ 900MHz... ?

Rys · Jan 5, 2019

w0lfram said:
I do not consider AMD's "chiplet" design as MCM.

It's definitely a MCM. You don't have to redefine the terminology everyone has understood and used for a very long time now (well over 20 years in mass-market semiconductors).

w0lfram · Jan 5, 2019

Yeah just like a Motorola MicroTAC is a smartphone...

DmitryKo · Jan 5, 2019

3dilettante said:
Past discussions centered on research into MCM GPUs running compute workloads, which is consistent with the RTG interview mentioning that compute can more readily benefit. Which graphics workloads were evaluated?

Nvidia was testing both memory-bound and computation-bound workloads with their theoretical design - these were running non-realtime on their supercomputer simulator. The research paper cites these command-line CUDA workloads simply because these would be much easier to script and profile in this environment.

All graphics tasks are still either compute-bound or memory bound (or both) - so profiling real games vs memory-bound CUDA tasks would not really change major design decisions about overcoming inefficiencies of far memory accesses.

There are additional architectural changes made to make parts of the hierarchy aware of the non-unified memory subsystem, and some leverage behaviors unique to Nvidia's execution model, such as the synchronization method for the remote L1.5 caches, which synchronize at kernel execution boundaries that also synchronize the local L1. GCN's write-through behavior is very different and would make Nvidia's choice significantly less effective.

3dilettante said:
Infinity Fabric is an extension of coherent Hypertransport, and the way the data fabric functions includes probes and responses meant to support the kinds of shared multiprocessor cache operations of x86. As of right now, GCN caches are far more primitive and seem to only use part of the capability. There are a number of possibilities that fall short of full participation in the memory space.

w0lfram said:
if they designed their new architecture so that (lets say) that the L2 can be unified with as many GPU/cores as connected (Or at some other hierarchy), where is the problem you describe..?

Yes, transparent NUMA access would require a redesign of the memory/cache controller to include write-back logic and an inter-processor cache coherency protocol. This would probably require breaking changes to the memory subsystem and register file, which would warrant a new microarchitecture and a different instruction set.

However AMD should be in a much better position for successfully implementing these changes - they have in-house engineers working for several decades on multi-socket x86 processors with a coherent inter-processor link (HyperTransport), they have a working PCIe 4.0/xGMI implementation in Vega 20, they have 128 PCIe links in the EPYC Milan (ZEN3) processor (and PCIe 4.0 too?!), and they lead several industry consortiums that develop cache-coherent non-uniform memory access (ccNUMA) protocols such as Gen-Z, Cache Coherent Interconnect for Accelerators (CCIX), Coherent Accelerator Processor Interface (OpenCAPI) etc...

... and then in the real world, we have the "The AMD Execution Thread" saga.

Rootax said:
not RTG motto for years now, it seems...

Well, there were rumors that AND diverted engineering resources from Vega to Navi, which should be either a completely new architecture or the last, heavily revised iteration of GCN . Would be a shame if these rumors would land in "AMD Execution Thread" again...

del42sa said:
But what type of workloads they did test ? HPC type and compute workloads which is in line with Wang's own words about MCM GPU.
But games and game API's is another story....
I didn't notice, that any of these studies tested such MCM approach in games.s

One more time. We are talking about multi-socket non-uniform memory access. You have a memory-bound application which needs to access this very slow far memory - so you have to design your caches and inter-die links to provide as much performance as you can within your external pin, die size, and thermal power budgets.

How exactly profiling real-world games would lead you to completely different engineering decisions comparing to profiling memory-bound CUDA tasks?

w0lfram said:
Unified memory means that both GPU use & share the same memory pool. Until now, no uarch was able to maintain that claim

Yep, it's rather non-uniform memory access, where each GPU would access the other GPU's local memory through inter-processor links.

Kaotik said:
Other than speed (which inter-chip IF won't fix either I think), how exactly NVLink can't maintain that claim? It offers cache coherent shared memory pool between 2 or more GPUs and even a CPU if you want

Exactly.

del42sa · Jan 5, 2019

DmitryKo said:
One more time. We are talking about multi-socket non-uniform memory access. You have a memory-bound application which needs to access this very slow far memory - so you have to design your caches and inter-die links to provide as much performance as you can within your external pin, die size, and thermal power budgets.

How exactly profiling real-world games would lead you to completely different engineering decisions comparing to profiling memory-bound CUDA tasks?

I believe Wang explained it well here :
https://www.pcgamesn.com/amd-navi-monolithic-gpu-design

DmitryKo · Jan 5, 2019

del42sa said:
I believe Wang explained it well here

No. He said that game developers avoid coding for explicit multi-GPU - as obviously very few gamers have more than one video card, unlike professional/blockchain/AI users - and that given proper OS support, a MCM-GPU with ccNUMA memory is very possible...

...which could be presented as a single video card to the developer, according to Nvidia research papers above, and thus the egg-chicken problem would be solved.

Rys · Jan 5, 2019

w0lfram said:
Yeah just like a Motorola MicroTAC is a smartphone...

No, not like that.

3dilettante · Jan 6, 2019

w0lfram said:
Yeah for comparison, but no need to concern yourself with AMD's CGN, because it is not the future of AMD.

Going back to the original claim, it was that transparent multi-GPU was stopped solely by the lack of unified memory, and that--given what we know now--there's nothing technically blocking this.
My response was that unified memory wasn't the only obstacle, and what we currently know has architectural edges that remain a problem.

And.. if they designed their new architecture so that (lets say) that the L2 can be unified with as many GPU/cores as connected (Or at some other hierarchy), where is the problem you describe..?

If the answer to the "what we know now" question is that something new fixes these concerns with unknown methods, then fine. However, one of the elements I was discussing was that there is a set of hardware elements in the graphics pipeline that do not engage with the L2--either in part or fully. Whatever the L2 does wouldn't matter to them, so there's some other assumption built into this scenario not being discussed.

Aside from that, just saying an L2 can be unified with an unbounded number of clients, no timeline, and without a clear description of where they are physically is too fuzzy to know where the bottlenecks or problems may be.
The most probable improvements for various elements like silicon process, packaging tech, and architecture have some limits in what we can expect in terms of scaling in the near and mid-term.

I do not consider AMD's "chiplet" design as MCM.

If you mean like AMD's Rome. It's an MCM in the standard sense.
The involvement of passive and then active interposers in some of the future proposals can be called 2.5D or 3D integration at least by vendors in order to differentiate them from the more standard MCMs that are placed on more conventional substrates, but aside from the marketing distinction they are modules with more than one chip.

It is a sandwich design & not necessarily an edge-to-edge design (ie wafer sitting next to wafer.) So the interconnects are not necessarily edge to edge, but.... per layer (or plane).

I don't know if I follow what you mean by edge to edge in this case.

Again, two GPU don't have to be side-by-side, but stacked on top of each other... sharing the same pinouts with fabric in middle... (otherwise two mirror GPU would have to be made to have exact pinout alignment.)

Thermal problems and wire routing are an immediate concern. If you mean literally stacked one on top of the other, massive thermal issues and significant limitations in terms of how to get enough wires to the internals for power and data. These chips already have a massive pinout, and whatever is in the stack gets pierced by very large and very disruptive TSVs.

If the GPUs are just on opposite sides of a PCB, it's a significantly more complicated thermal and mounting environment, plus a much more complex layering and wire congestion problem getting enough signals and power into the region. For all that, such a solution does not appear to be much different from having them side by side.

Regardless of any of this, getting signals off-die has an area and power cost, regardless of the exact choice in location of the other die. Closer typically means lower cost, but taking what is currently on-die and massive in terms of signals makes even small costs significant. A global L2 that claims to be as unified for multiple chips as it was within one die is going to be wrestling with wires climbing from tens of nm in size to hundreds/thousands of nm.

I don't know how far AMD has taken IF2.0, but YES... I do b believe you can have - "but cannot change the internals of what it connects to" -

That^^ idea
Is the whole point of AMD's chiplet/fabric design. It is not a dumb interconnect, it can manage it's own data, and interposing, etc. And have different layering of each.

If the fabric changed what it attached to, it would hamper AMD's goal of using it as a general glue to combine IP blocks without forcing either the fabric or the IP block to change when they are combined. For these purposes, it is just an interconnect and cannot make non-coherent hardware coherent.

DmitryKo said:
Nvidia was testing both memory-bound and computation-bound workloads with their theoretical design - these were running non-realtime on their supercomputer simulator. The research paper cites these command-line CUDA workloads simply because these would be much easier to script and profile in this environment.

That seems like a hole too significant to extrapolate over. If a workload is too complex for them to evaluate absent real-time performance requirements, why is it certain that solutions that are declared as working without having complexity or responsiveness as part of their evaluation are sufficient?

All graphics tasks are still either compute-bound or memory bound (or both) - so profiling real games vs memory-bound CUDA tasks would not really change major design decisions about overcoming inefficiencies of far memory accesses.

Graphics tasks can also be synchronization or fixed-function bound, among other things, but that aside I would have liked confirmation that other elements of their workloads could be matched to a game or graphics workload.
A graphics pipeline is a mixture of cooperative and hierarchical tasks, and one area I would like some comparison is whether the modeled cooperative thread array scheduling behavior they modeled fit. Did warp and kernel behaviors align in terms of their access patterns, dependences, and behaviors? Some of the compute workloads were characterized as being relatively clean in how they wrote to specific ranges or could work well with a first-touch page allocation strategy.
Are graphics contexts as clean, or which compute workloads may have complex input/output behaviors that have been found well-handled?

Do the non-evaluated graphics workloads have similar locality behaviors? Can the behavior swing enough to hinder the L1.5 or upset the mix of local/remote accesses? Not finding significant problems with unrelated workloads is not comfortable for me.

Then there are architectural questions, such as with GCN. Does the remote-coherent memory hierarchy break elements like DCC? In that case, there would be graphics workloads that weren't limited by a resource in a monolithic context that become bottlenecked after. If there are scheduling methods or optimizations that avoid this, are they included in the CTA scheduling methods discussed?

As a counterpoint, if the CTA scheduling was straightforward and transparent for the profiled workloads, how should I look at the explicitly exposed thread management of the geometry front end with task and mesh shaders? It might be nice if there were a profiled workload that would indicate that whatever problem those shaders address has been handled.

del42sa · Jan 6, 2019

DmitryKo said:
No. He said that game developers avoid coding for explicit multi-GPU - as obviously very few gamers have more than one video card, unlike professional/blockchain/AI users - and that given proper OS support, a MCM-GPU with ccNUMA memory is very possible...

...which could be presented as a single video card to the developer, according to Nvidia research papers above, and thus the egg-chicken problem would be solved.

and you seem fail to understand, that that there is difference between HPC computing and running 3D game. On HPC we already use more GPU´s to run task in parallel and it doesn´t really matter if those GPU´s are on separate boards/PCIe or if you just put them all on one PCB/Interposer.

DmitryKo · Jan 9, 2019

3dilettante said:
transparent multi-GPU was stopped solely by the lack of unified memory, and that--given what we know now--there's nothing technically blocking this.

Lack of "shared memory" with "uniform memory access". Not "unified memory" or unified access.

there is a set of hardware elements in the graphics pipeline that do not engage with the L2--either in part or fully. Whatever the L2 does wouldn't matter to them

Every part of the pipeline has to read from memory, so the all data necessarily passes through the L2 cache - even fixed function parts like hierarchical Z-compression which probably have their own private caches not documented in the ISA document.

just saying an L2 can be unified with an unbounded number of clients, no timeline, and without a clear description of where they are physically is too fuzzy to know where the bottlenecks or problems may be

It's perfectly fine to apply generalisations learned from the past 35-40 years that superscalar out-of-order microprocessors have been researched and marketed.

So far it's not even certain that ccNUMA support would require modifications to the instruction format or a new microarchitecture to extend load-store microops to full 48-bit (or 52-bit) virtual memory space. Unless they really need a four-chip design each addressing a 1TB SSD, I'd guess their current 40-bit PA architecture would be good enough for another iteration.

If a workload is too complex for them to evaluate absent real-time performance requirements, why is it certain that solutions that are declared as working without having complexity or responsiveness as part of their evaluation are sufficient?

How running a memory-bound task in real-time would be different from running it non-realtime in a simulator?

Chip developers do not really play commercial computer games in real-time to verify their design decisions. They would rather work with ISVs to profile their applications then recreate these workloads using scriptable non-realtime tasks. But honestly they don't even need to do that, since in-house programmers will make enough guidance as to the expected bottlenecks. And they probably have enough Altera FCPGA boards to test significant parts of the chip in the hardware simulator (still slow, but 'only' three orders of magnitude slower than the real thing) before actual tape-out.

https://rys.sommefeldt.com/post/semiconductors-from-idea-to-product/
https://rys.sommefeldt.com/post/drivers/

I would have liked confirmation that other elements of their workloads could be matched to a game or graphics workload.
Do the non-evaluated graphics workloads have similar locality behaviors? Not finding significant problems with unrelated workloads is not comfortable for me.

Both memory-intensive and compute-intensive workloads were evaluated - they just give more details on memory-intensive tests since this was the object of the study, and compute-intensive workloads show little variation with memory bandwidth.

Does the remote-coherent memory hierarchy break elements like DCC? In that case, there would be graphics workloads that weren't limited by a resource in a monolithic context that become bottlenecked after. If there are scheduling methods or optimizations that avoid this, are they included in the CTA scheduling methods discussed? As a counterpoint, if the CTA scheduling was straightforward and transparent for the profiled workloads, how should I look at the explicitly exposed thread management of the geometry front end with task and mesh shaders? It might be nice if there were a profiled workload that would indicate that whatever problem those shaders address has been handled.

Yes, it would be nice, but unfortunately they've just stopped a little short of starting a comprehensive developer's guide to modern high-performance GPUs

Framebuffer and Z/W buffer access is one part that would be hard to optimize for locality. One strategy would be designating a 'master'

GPU to hold a single local copy of the frame and Z/W buffers; ahother would be to

let graphics driver relocate physical memory pages to another GPU on-demand based on performance counters. Tile-based rendering approach also comes to mind.

DmitryKo · Jan 9, 2019

del42sa said:
and you seem fail to understand, that that there is difference between HPC computing and running 3D game. On HPC we already use more GPU´s to run task in parallel and it doesn´t really matter if those GPU´s are on separate boards/PCIe or if you just put them all on one PCB/Interposer.

Is that your answer to my question: "how would you optimize ccNUMA differently"?

BRiT · Jan 9, 2019

AMD Radeon VII thread: https://forum.beyond3d.com/threads/amd-radeon-vii-announcement-and-discussion.61039/

3dilettante · Jan 9, 2019

DmitryKo said:
Every part of the pipeline has to read from memory, so the all data necessarily passes through the L2 cache - even fixed function parts like hierarchical Z-compression which probably have their own private caches not documented in the ISA document.

DCC's hardware path was documented as not going through the L2. The ROP caches spill into the L2, but for non-ROP clients to read that data there's a heavy synchronization event. The command processor needs to be stalled, and the ROP caches and potentially the L2 flushed. Vega is documented as still potentially needing an L2 flush in certain configurations, if there is some kind of alignment mismatch between the RBEs and L2. This potentially is related to some of the smaller variants of the design.
Vega's geometry engine was ambiguously described as having driver-selected L2-backed streamout and/or the more traditional parameter cache.
Other elements of the graphics pipeline are incompletely described in how they interact.

DCC is the most significantly separate data path disclosed. The others may use the L2, but that isn't sufficient to say they are coherent. The CU's vector memory path is considered coherent because the L1 is inclusive of the L2 and will write through immediately. Other clients, like the ROPs have different behaviors and require heavier flush events and stalls before their data can be accessed safely.
These are not details a fabric like IF can change. The fabric can carry data and probes, but cannot force clients to generate them consistently.

At present the hardware subject to these restrictions can monitor for status updates from their local pipeline elements for completion time. Historically, that did not include monitoring the readiness of hardware pipelines not directly wired to them.

It's perfectly fine to apply generalisations learned from the past 35-40 years that superscalar out-of-order microprocessors have been researched and marketed.

That history is after the functionality was consolidated onto one die. I was responding to a claim ambiguous as to whether this upscaled GCN L2 would be off-chip, since some of the statements made seemed to indicate they could be.
A straightforward extrapolation of the relationship between on-die connections and solder balls is the expectation that data paths expand by several orders of magnitude in size and power, and this is before including the claim that it would then be used to connect to multiple other GPUs with just as many clients.

Chip developers do not really play commercial computer games in real-time to verify their design decisions.

Were there compute workloads profiled that had response time or QoS requirements comparable to a graphics workload? It may be possible to extrapolate partially if the testing framework kept track of the perceived timing within the simulation to determine it. It's still a jump saying that if an MCM can support an interactive workload on a characterized compute path N, it can work well with undefined hardware graphics path X.

Both memory-intensive and compute-intensive workloads were evaluated - they just give more details on memory-intensive tests since this was the object of the study, and compute-intensive workloads show little variation with memory bandwidth.

There can be variations in the reasons behind why a workload is is found to be bandwidth or ALU intensive. A large and uniform compute workload can have predictable behaviors and linear fetch behaviors. This can allow it to be tuned to fit within the limits of the L1.5's capacity, and can allow for wavefronts to be directed to launch where it is known they have the most locality or to adjust their behavior to fit the locale they find themselves in.

Is there a similar level of control over the access patterns of a given graphics wavefront to do so as accurately? The GPU architectures historically dedicated a lot of hardware to provide consistent access at high bandwidth for a significant amount of uncorrelated behavior. That created an assumption about how important placement in the GPU was relative to other resources, and for much of the architecture it wasn't important. A CU sees a very uniform environment, and a lot of the tiling behaviors and data paths strive to distribute work to take advantage of uniformity that an MCM no longer has.

Deleted member 13524 · Jan 10, 2019

Honest question: why aren't we discussing Radeon VII based on Vega 20 on a thread that says "(..) Vega 20 rumors and discussion"?

Radeon VII is most probably the last Vega GPU from AMD, so if we aren't talking about that GPU here, what's the point of this thread?
Is it to trim thread size?

BRiT · Jan 10, 2019

ToTTenTranz said:
Honest question: why aren't we discussing Radeon VII based on Vega 20 on a thread that says "(..) Vega 20 rumors and discussion"?

Radeon VII is most probably the last Vega GPU from AMD, so if we aren't talking about that GPU here, what's the point of this thread?
Is it to trim thread size?

Because this one is more about rumors and now that its announced, its not a rumor?

But mostly because multi-year mega threads suck.

DmitryKo · Jan 10, 2019

3dilettante said:
DCC's hardware path was documented as not going through the L2. The ROP caches spill into the L2, but for non-ROP clients to read that data there's a heavy synchronization event. The command processor needs to be stalled, and the ROP caches and potentially the L2 flushed.
At present the hardware subject to these restrictions can monitor for status updates from their local pipeline elements for completion time. Historically, that did not include monitoring the readiness of hardware pipelines not directly wired to them.

Well, L1/L1.5/L2 hierarchy is just one of potential designs.

According to the Vega architecture whitepaper and Vega ISA document, HBCC is already a full-blown cross-bar controller, with HBM memory and NVRAM memory interfaces, vritual page memory support, and Infinity Fabric interface to local L2 caches.

So they might be able to remove the memory controller to a separate die, connected with L2 caches in each respective GPU chiplet through Infinity Fabric. This would require roughly the same number of chiplets in a four-part configuration - 1 memory/IO controller, 4 HBM2/HBM3, and 4 GPU. This way L2 remains on each GPU, and the crossbar/memory controller may implement advanced coherency protocols and/or additional cache memory to mitigate possible stalls.

Or they might be able to design sufficiently fast Infinity Fabric connections so that each separate memory controller acts like a memory channel in a global distributed crossbar/memory controller.

A straightforward extrapolation of the relationship between on-die connections and solder balls is the expectation that data paths expand by several orders of magnitude in size and power, and this is before including the claim that it would then be used to connect to multiple other GPUs with just as many clients.

Let's consider the 9-die setup above (with 4 HBM2 modules, 1 memory IO, and 4 GPU). Would it really require "orders of magnitude" in comparison to existing Vega design?

It may be possible to extrapolate partially if the testing framework kept track of the perceived timing within the simulation to determine it. It's still a jump saying that if an MCM can support an interactive workload on a characterized compute path N, it can work well with undefined hardware graphics path X.
CU sees a very uniform environment, and a lot of the tiling behaviors and data paths strive to distribute work to take advantage of uniformity that an MCM no longer has.

Let's just conclude that these results are not convincing enough for you. We wouldn't really know if these details would result in a substantially different real-world performance, without access to Nvidia's library of commercial-grade building blocks and expensive simulation hardware, save for the ability to tape-out a working prototype.

Were there compute workloads profiled that had response time or QoS requirements comparable to a graphics workload?

Here is the full list of memory-intensive workloads Nvidia used for the two research papers above and their memory footprints (from tables 4 and 2 of respective publications).

Memory intensive high-parallelism workloads and memory footprints
Neural Network Convolution NN-Conv 0.5GB
Stream Triad Stream 3GB
SRAD V2 Srad-V2 96MB
Lulesh (size 150) Lulesh1 1.9GB
Shortest path SSSP 37 MB
Lulesh (size 190) Lulesh2 4.3GB
Adaptive Mesh Refinement MiniAMR 5.4GB
Kmeans clustering Kmeans 216MB
Nekbone solver (size 18) Nekbone1 1.8GB
Lulesh unstructured Lulesh3 203MB
Breadth First Search BFS 37MB
Mini Contact Solid Mechanics MnCtct 251MB
Nekbone solver (size 12) Nekbone2 287MB
Algebraic multigrid solver AMG 5.4GB
Minimum Spanning Tree MST 73MB
Computer Fluid Dynamics Euler3D CFD 25MB
Classic Molecular Dynamics CoMD 0.4GB
Sorted by sensitivity to inter GPM bandwidth
M-Intensive
C-Intensive
Lim.Parallel
Benchmark Time-weighted average CTAs Memory footprint
Rodinia-Euler3D 1008 25MB
HPC-AMG 241549 3744MB
HPC-RSBench 7813 19MB
HPC-CoMD-Ta 5724 394MB
HPC-Lulesh-Unstruct-Mesh2 4940 208MB
HPC-Lulesh 12202 578MB
HPC-Nekbone-Large 583 294MB
Lonestar-MST-Mesh 895 75MB
HPC-Lulesh-Unstruct-Mesh1 435 19MB
Lonestar-SP 75 8MB
HPC-CoMD-Wa 13691 393MB
HPC-HPGMG-UVM 10436 1571MB
Rodinia-BFS 1954 38MB
HPC-CoMD 3588 88MB
Lonestar-SSSP-Wlc 163 21MB
HPC-MCB 50001 162MB
ML-AlexNet-cudnn-Lev2 1250 832MB
Rodinia-Gaussian 2599 78MB
Other-Optix-Raytracing 3072 87MB
HPC-MiniContact-Mesh2 15423 257MB
Lonestar-MST-Graph 770 86MB
HPC-Namd2.9 3888 319MB
Lonestar-SSSP-Wln 60 21MB
HPC-MiniContact-Mesh1 250 21MB
ML-GoogLeNet-cudnn-Lev2 6272 1205MB
Rodinia-Hotspot 7396 64MB
HPC-SNAP 200 744MB
ML-AlexNet-cudnn-Lev4 1014 32MB
Rodnia-Pathfinder 4630 1570MB
Lonestar-SSSP 1046 38MB
HPC-MiniAMR 76033 2752MB
HPC-HPGMG 10506 1571MB
Lonestar-DMR 82 248MB
Rodinia-Srad 16384 98MB
Rodinia-Backprop 4096 160MB
Other-Stream-Triad 699051 3146MB
Other-Bitcoin-Crypto 60 5898MB
ML-AlexNet-ConvNet2 6075 97MB
HPC-RabbitCT 31072 524MB
ML-OverFeat-cudnn-Lev3 1800 388MB
Rodinia-Kmeans 3249 221MB

I fully trust you to devote sufficient time for thorough analysis of these workloads; please come back to us with complete detailed results.

3dilettante · Jan 10, 2019

DmitryKo said:
Well, L1/L1.5/L2 hierarchy is just one of potential designs.

If the discussion is about hardware that doesn't really use the caches in a consistent fashion anyway, changes to cache hierarchy don't affect them.
The research opted to evaluate workloads and architectures that do not use those elements.

According to the Vega architecture whitepaper and Vega ISA document, HBCC is already a full-blown cross-bar controller, with HBM memory and NVRAM memory interfaces, vritual page memory support, and Infinity Fabric interface to local L2 caches.

This is outside the GPU's core area, and the IF interface isolates the sides from each other. It is designed to track references and perform data moves on occasional faults, but it doesn't implement a cache coherence protocol since it works in response to outbound accesses from the L2.

So they might be able to remove the memory controller to a separate die, connected with L2 caches in each respective GPU chiplet through Infinity Fabric. This would require roughly the same number of chiplets in a four-part configuration - 1 memory/IO controller, 4 HBM2/HBM3, and 4 GPU. This way L2 remains on each GPU, and the crossbar/memory controller may implement advanced coherency protocols and/or additional cache memory to mitigate possible stalls.

The L2 itself is not designed to be snooped, so it wouldn't be a peer in a basic coherence protocol. If a client doesn't perform many functions expected of a coherent client, it won't perform them just because it is plugged into a bigger interconnect.
In this chiplet architecture, the IO die has 4 HBM interfaces and at least 4 links to the satellite GPU chiplets. The IO die would be large, with two sides dominated by HBM stacks, and only two other sides for GPU links. Going by the presentation on Rome, a lot of the perimeter of a large IO die is for IF links, when the DDR system bandwidth is 5-10X below the proposed HBM system bandwidth. Any solution trying to leverage those links needs to adjust for a load an order of magnitude larger.

Power-wise, the off-die links will consume additional power--tens of watts just to move from the IO die to the GPU die in addition to the existing HBM's power cost. There isn't discussion about how all the graphics pipeline events would be handled, since many were direct signals or internal buffer updates on-die with no corresponding memory or IF messaging, and the individual L2s would be incoherent with each other in a way that GCN does not allow architecturally.

Let's consider the 9-die setup above (with 4 HBM2 modules, 1 memory IO, and 4 GPU). Would it really require "orders of magnitude" in comparison to existing Vega design?

I was addressing an ambiguous design scenario where the L2 was possibly off-die while still serving the same role as it does when adjacent to the CUs. That's potentially 16 or more L2 slices per GPU, with each capable of 64B in each direction, with 4 or more GPUs apparently able to plug into it. Package or interposer links have their data paths in the hundreds or tens of nm, and ubumps have 40-50 microns for their pitch. 64K interposer traces, TSVs, or ubumps is more pad space than GPU.

Let's just conclude that these results are not convincing enough for you.

I think they pointedly avoided evaluated graphics workloads. Since they work for a vendor whose business is still heavily invested in pixel pushing, not mentioning pixels once seems like it wasn't an accident.

Here is the full list of memory-intensive workloads Nvidia used for the two research papers above and their memory footprints (from tables 4 and 2 of respective publications).

I fully trust you to devote sufficient time for thorough analysis of these workloads; please come back to us with complete detailed results.

I am not familiar with many of those workloads, though some of the HPC loads and the initial entries I ran through google may point to well-structured problems. Neural net training is treated at a hardware level like a lot of matrix multiplication. Stream and HPC workloads often focus on matrix math. The bitcoin-type entry is a lot of very homogenous hashing. Which ones share which behaviors with the graphics pipeline is unclear to me. My position is that if they don't mention graphics as being applicable, I am not inclined to make the leap for them.

DmitryKo · Jan 13, 2019

3dilettante said:
changes to cache hierarchy don't affect them

The point is, larger caches and/or faster links with improved protocols may allow further unification - like it happened in Vega10 for pixel and geometry engines which were connected to the L2 cache.

the IF interface isolates the sides from each other
it doesn't implement a cache coherence protocol
The L2 itself is not designed to be snooped

A matter for implementation then, and AMD are experts in cache coherent multiprocessor systems.

the individual L2s would be incoherent with each other in a way that GCN does not allow architecturally.

But how this would be different from the current architecture?

Vega whitepaper and Vega ISA document imply that L2 is split between separate memory controllers/channels and interconnected through IF/crossbar.

The IO die would be large, with two sides dominated by HBM stacks
Any solution trying to leverage those links needs to adjust for a load an order of magnitude larger.
the off-die links will consume additional power--tens of watts just to move from the IO die to the GPU die in addition to the existing HBM's power cost.

What if we scale down the package to 2 HBM and 2 GPU dies and a 7nm respin of the IO die?

Could lower voltage logic maintain signal integrity?

The research opted to evaluate workloads and architectures that do not use those elements.
I think they pointedly avoided evaluated graphics workloads.
if they don't mention graphics as being applicable, I am not inclined to make the leap for them.

Let's agree to disagree.
I'd fully expect them to have full access to a comprehensive suite of proprietary tests used on the same simulator hardware during the course of actual GPU development. Just like I'd expect them to withhold that information from the public.
And no, I wasn't really expecting you to actually inspect these workloads.

AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

w0lfram

3dilettante

w0lfram

Rys

Graphics @ AMD

w0lfram

DmitryKo

del42sa

DmitryKo

Rys

Graphics @ AMD

3dilettante

del42sa

DmitryKo

DmitryKo

BRiT

(>• •)>⌐■-■ (⌐■-■)

3dilettante

Deleted member 13524

Guest

BRiT

(>• •)>⌐■-■ (⌐■-■)

DmitryKo

3dilettante

DmitryKo