AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Bondrewd · Aug 11, 2019

Frenetic Pony said:
CXL seems the least interesting

Also the simplest one, which significantly lowers that chances of non-host implementations being all kinds of fucked up.

Frenetic Pony said:
Regardless, I'd think the question is how much workloads even saturate Express 3 now, let alone 4

Try plugging your fancy new 200GbE NIC into PCIe3 x16 and look at it go!

Shaklee3 · Aug 11, 2019

Bondrewd said:
Also the simplest one, which significantly lowers that chances of non-host implementations being all kinds of fucked up.

Try plugging your fancy new 200GbE NIC into PCIe3 x16 and look at it go!

I know it's not common, but I have workloads that saturate pcie. And anyone with modern infiniband is doing it as well.

Also, the mellanox connect-x6 supports dual 200Gbps ports, so it's already limited by the interface speed. And 200Gbps switches are being released now, with 400 coming by the end of the year. So the interconnect is again the bottleneck.

w0lfram · Aug 12, 2019

But that is not the same bottleneck that we are discussing in Navi.

I always go back to what made crossfire & SLI good & bad. And why scaling was never (or could ever) be 100%. (Or the pros/con of a dual-gpu card.) How many lanes/links was SLI and what speed..? And if we double that? Triple that? What is the next bottleneck?

What stops AMD from making a multi-gpu design, like Ryzen? From my understanding, that AMD's infinity Fabric (2.0) is fast and can have many lanes (bandwidth), so what is stopping AMD from linking up all the GPUs to the same level 2 cache? There are many threads popping up asking about IF ability to connect GPUs in some fashion and looking at Navi's block diagram there is fabric there..

So if you extend that fabric off the chip, then RDNA can be extended indefinitely..

Shaklee3 · Aug 12, 2019

w0lfram said:
But that is not the same bottleneck that we are discussing in Navi.

I always go back to what made crossfire & SLI good & bad. And why scaling was never (or could ever) be 100%. (Or the pros/con of a dual-gpu card.) How many lanes/links was SLI and what speed..? And if we double that? Triple that? What is the next bottleneck?

What stops AMD from making a multi-gpu design, like Ryzen? From my understanding, that AMD's infinity Fabric (2.0) is fast and can have many lanes (bandwidth), so what is stopping AMD from linking up all the GPUs to the same level 2 cache? There are many threads popping up asking about IF ability to connect GPUs in some fashion and looking at Navi's block diagram there is fabric there..

So if you extend that fabric off the chip, then RDNA can be extended indefinitely..

Yes, I agree. What's stopping them from adding more lanes is the question. At the end of the day these are just a bunch of 20-25GHz lanes stacked up. And I think it's important that the gpu-gpu connection continues to get faster until the memory bandwidth is the limit.

Malo · Aug 12, 2019

w0lfram said:
What stops AMD from making a multi-gpu design, like Ryzen?

There is no GPU-equivalent of NUMA for starters, and mGPU is something each game developer has to support. There's extensive discussion of the problem in this thread and others.

Rootax · Aug 12, 2019

And IF links (or any links) are not free for the power enveloppe...

pTmdfx · Aug 12, 2019

w0lfram said:
What stops AMD from making a multi-gpu design, like Ryzen? From my understanding, that AMD's infinity Fabric (2.0) is fast and can have many lanes (bandwidth), so what is stopping AMD from linking up all the GPUs to the same level 2 cache? There are many threads popping up asking about IF ability to connect GPUs in some fashion and looking at Navi's block diagram there is fabric there...

The fabric connecting the GPU constituents and the L2 cache has a vastly different set of requirements to the Infinity Fabric.

The GPU linear memory is interleaved between all the memory channels at a fine granularity to take advantage of the data parallelism, maximize bandwidth and minimize conflict. The L2 cache is partitioned to match the number of memory channels, and therefore each of them is connected to every clients due to the fine-grained interleaving.

Let's do a quick math. Navi has all its units bundled to four L1 caches, and all L1 caches can move 4x 64B per clock to the L2. The boost clock of Navi is apparently around 1900 MHz. So for each L1 (5 CUs), you will need at least a link of 4 ports * 64B * 1.9 GHz = 486.4 GB/s, which is almost close to the total bandwidth of two PCIe 5.0 x16 link. Given that Navi 10 has four L1 caches and a "Hub" client on the L1-L2 fabric, Navi 10 theoretically moves 2.375 TB of data through its L1-L2 fabric per second. If you assume the "Big Navi" is going to have 64 CUs, that's probably 4.275 TB/s, or ~17 PCIe 5.0 x16 links.

With this large amount of bandwidth to move, 2.5D TSV would probably be the sole viable option, and even the best on-package SerDes is going to be off the table IMO. No matter how you split the 64 CU Navi into multiple chips (equivalent half, or Rome style), you would still need to move 4.275 TB/s between the chips to maintain parity, which translates into a ~9000-bit HBM2E equivalent interface.

I have no clue if this would be all worth it. But my two cents is that given a mature process, GPU vendors would probably still prefer to build monolithic GPUs to save these precious joules at this scale of data movement.

Frenetic Pony · Aug 13, 2019

pTmdfx said:
The fabric connecting the GPU constituents and the L2 cache has a vastly different set of requirements to the Infinity Fabric.

The GPU linear memory is interleaved between all the memory channels at a fine granularity to take advantage of the data parallelism, maximize bandwidth and minimize conflict. The L2 cache is partitioned to match the number of memory channels, and therefore each of them is connected to every clients due to the fine-grained interleaving.

Let's do a quick math. Navi has all its units bundled to four L1 caches, and all L1 caches can move 4x 64B per clock to the L2. The boost clock of Navi is apparently around 1900 MHz. So for each L1 (5 CUs), you will need at least a link of 4 ports * 64B * 1.9 GHz = 486.4 GB/s, which is almost close to the total bandwidth of two PCIe 5.0 x16 link. Given that Navi 10 has four L1 caches and a "Hub" client on the L1-L2 fabric, Navi 10 theoretically moves 2.375 TB of data through its L1-L2 fabric per second. If you assume the "Big Navi" is going to have 64 CUs, that's probably 4.275 TB/s, or ~17 PCIe 5.0 x16 links.

With this large amount of bandwidth to move, 2.5D TSV would probably be the sole viable option, and even the best on-package SerDes is going to be off the table IMO. No matter how you split the 64 CU Navi into multiple chips (equivalent half, or Rome style), you would still need to move 4.275 TB/s between the chips to maintain parity, which translates into a ~9000-bit HBM2E equivalent interface.

I have no clue if this would be all worth it. But my two cents is that given a mature process, GPU vendors would probably still prefer to build monolithic GPUs to save these precious joules at this scale of data movement.

Design costs are getting exponential, just designing one chiplet then multiplying that might eventually become cost effective. But it'd help if there were low resistance interconnect materials. Graphene just for conducting wiring material would be a huge boon here, and you'd not even need to make it semi conducting, with close to no leakage this would make the bandwidth power costs drop dramatically in scenarios like this. Of course this assumes any chip maker would bother with this, for the present it still seems like they're all fixated on their kamikaze course of using silicon/high k metals/etc. until the point, soonish, when no one on earth can actually afford a new fab, let alone to design chips on "1 nanometer" or whatever they'll end up calling it.

Bondrewd · Aug 13, 2019

Frenetic Pony said:
Graphene just for conducting wiring material would be a huge boon here, and you'd not even need to make it semi conducting

We're having troubles sticking Co in metal stack and you're suggesting replacing metal entirely which is uuugh.

Frenetic Pony · Aug 13, 2019

Bondrewd said:
We're having troubles sticking Co in metal stack and you're suggesting replacing metal entirely which is uuugh.

It seems the only practical way forward, otherwise progress is going to just stop completely. Every node gets exponentially more expensive for every step, from R&D to building the fab to designing the chips. Cost per transistor going down has been a thing of the past for a while, when cost per transistor starts getting more expensive with each node customers will just stop paying for them.

Graphene conducts with almost no resistance already, and electron mobility present the possibility of multi terahertz clock speeds using no more power than today's logic does. That's the equivalent advance of twenty straight years of Moore's Law in absolutely ideal terms, actual Moore's Law where density doubles every two years, not the modern reinterpretation of "well we're lucky it's advancing at all". When fabs alone cost 15 billion dollars and rising, and there's ever less cutting edge foundries even trying to shrink gates, pretending advancing silicon will continue feels disingenuous.

w0lfram · Aug 13, 2019

pTmdfx said:
The fabric connecting the GPU constituents and the L2 cache has a vastly different set of requirements to the Infinity Fabric.

The GPU linear memory is interleaved between all the memory channels at a fine granularity to take advantage of the data parallelism, maximize bandwidth and minimize conflict. The L2 cache is partitioned to match the number of memory channels, and therefore each of them is connected to every clients due to the fine-grained interleaving.

Let's do a quick math. Navi has all its units bundled to four L1 caches, and all L1 caches can move 4x 64B per clock to the L2. The boost clock of Navi is apparently around 1900 MHz. So for each L1 (5 CUs), you will need at least a link of 4 ports * 64B * 1.9 GHz = 486.4 GB/s, which is almost close to the total bandwidth of two PCIe 5.0 x16 link. Given that Navi 10 has four L1 caches and a "Hub" client on the L1-L2 fabric, Navi 10 theoretically moves 2.375 TB of data through its L1-L2 fabric per second. If you assume the "Big Navi" is going to have 64 CUs, that's probably 4.275 TB/s, or ~17 PCIe 5.0 x16 links.

With this large amount of bandwidth to move, 2.5D TSV would probably be the sole viable option, and even the best on-package SerDes is going to be off the table IMO. No matter how you split the 64 CU Navi into multiple chips (equivalent half, or Rome style), you would still need to move 4.275 TB/s between the chips to maintain parity, which translates into a ~9000-bit HBM2E equivalent interface.

I have no clue if this would be all worth it. But my two cents is that given a mature process, GPU vendors would probably still prefer to build monolithic GPUs to save these precious joules at this scale of data movement.

Thank you for your reply.

But if the fabric is already handling all that, why can't another GPU be piggy-backed on the back side of it, sharing that same information. Two GPUs sandwiched with Fabric in the middle... yum!

And "laymanistically", SLI and Crossfire worked in many Games, but scaled horribly (due to alter-frame-rendering)... Fabric is much faster. So where is the drawback in having something like I just mentioned?

milk · Aug 13, 2019

yeah, even if Fabric doesn't scale as well as a single huge gpu, it still must scalebetter than two discrete cards connected through the mobo plus an extra adhock connector between them, no?

Ethatron · Aug 13, 2019

w0lfram said:
... but scaled horribly (due to alter-frame-rendering)...

Then don't do AFR. Stereo rendering scales perfectly, even with transfering the backbuffer over PCIe. 2 slaves for scene/light-passes in half-screen and a master integrated GPU for post would be the perfect setup.

pTmdfx · Aug 14, 2019

w0lfram said:
Thank you for your reply.

But if the fabric is already handling all that, why can't another GPU be piggy-backed on the back side of it, sharing that same information. Two GPUs sandwiched with Fabric in the middle... yum!

And "laymanistically", SLI and Crossfire worked in many Games, but scaled horribly (due to alter-frame-rendering)... Fabric is much faster. So where is the drawback in having something like I just mentioned?

Infinity Fabric is merely a marketing umbrella term — effective performance depends on the actual application and the implementation (link width, link speed, and most importantly, topology).

It is definitely possible to build a huge GPU that has a L1-L2 fabric spanning multiple chips. What I am trying to convey is that the fabric as it is on a monolithic GPU moves a ridiculous amount of data at terabyte scale. Most importantly, all CUs have equal bisection bandwidth & latency access to all memory channels (interleaving at a 256-byte granularity). This is what sets a monolithic GPU apart from a setting of multiple GPUs, which require implicit/explicit mGPU/AFR/SFR, however you call these NUMA approaches.

Going off-chip is, again, definitely possible, but it might not be practical. We are talking about 4TB/s for 64 CUs at 1.9 GHz — you either go for many high speed SerDes (cost you power), or you go for many wider links (cost you space and I/O density limitations). To (again) put it in numbers, 4 TB/s is equivalent to >48 on-package IFOP links.

3dilettante · Aug 14, 2019

pTmdfx said:
Let's do a quick math. Navi has all its units bundled to four L1 caches, and all L1 caches can move 4x 64B per clock to the L2. The boost clock of Navi is apparently around 1900 MHz. So for each L1 (5 CUs), you will need at least a link of 4 ports * 64B * 1.9 GHz = 486.4 GB/s, which is almost close to the total bandwidth of two PCIe 5.0 x16 link. Given that Navi 10 has four L1 caches and a "Hub" client on the L1-L2 fabric, Navi 10 theoretically moves 2.375 TB of data through its L1-L2 fabric per second. If you assume the "Big Navi" is going to have 64 CUs, that's probably 4.275 TB/s, or ~17 PCIe 5.0 x16 links.

The Navi cache diagram also has the links between cache levels denoted with bidirectional arrows. It's possible that this is 2*4*64B for one L1 to the L2, which bloats the number of connections further.
Even with 2.5D integration, the nearest example in terms of wire count and speed would be an HBM2 PHY. That's 128B worth of connections for an HBM stack, versus 4x that much just for one set of L1 to L2 connections.
The area consumption from the PHY alone to get even near the bandwidth count would cost a lot of area on both sides of the connection, so additional GPU chips would lose a large amount of area in the attempt.

w0lfram said:
Thank you for your reply.

But if the fabric is already handling all that, why can't another GPU be piggy-backed on the back side of it, sharing that same information. Two GPUs sandwiched with Fabric in the middle... yum!

And "laymanistically", SLI and Crossfire worked in many Games, but scaled horribly (due to alter-frame-rendering)... Fabric is much faster. So where is the drawback in having something like I just mentioned?

The transfer rate of the board links or PCIe bus was one potential bottleneck. Other bottlenecks include heavy synchronization overhead and frequent trips to the driver to manage the GPUs. A lot of that management can come from a graphics context that doesn't allow for multiple GPUs to be treated as a single unit, and an architecture that doesn't define correct behavior for multiple GPUs.
A GPU piggy-backing on the fabric has its own L2, internal controllers, countless buffers, and barely coherent caches. None of them know what to do with the equivalent resources in the other GPU, and unless doing something like duplicated contexts like stereoscopic rendering sharing means more frequent pipeline stalls, trips to the driver, and cache flushes--or incorrect behavior and crashes.

The infinity fabric in this case is a dumb pipe that cannot make the dumber hardware on either end smarter.
The disclosures on Navi don't really indicate many changes on this front. Maybe someday some of the secondary GFX pipeline elements could be spun into a more flexibly partitioned and distributed graphics context, but right now it seems more likely to be used for better priority management within a GPU.
Other elements like another incoherent cache layer with the L1 make things like sharing worse, and the L2 doesn't appear to be any smarter than before. The DCC everywhere feature means there's as many or more compressors, and if they're like prior generations that means multiple incoherent pipelines in a multi-chip context. There are additional controllers and pipelines elsewhere, and no mention of making them capable of functioning alongside duplicates in a multi-chip scenario.

w0lfram · Aug 14, 2019

3dilettante said:
The Navi cache diagram also has the links between cache levels denoted with bidirectional arrows. It's possible that this is 2*4*64B for one L1 to the L2, which bloats the number of connections further.
Even with 2.5D integration, the nearest example in terms of wire count and speed would be an HBM2 PHY. That's 128B worth of connections for an HBM stack, versus 4x that much just for one set of L1 to L2 connections.
The area consumption from the PHY alone to get even near the bandwidth count would cost a lot of area on both sides of the connection, so additional GPU chips would lose a large amount of area in the attempt.

The transfer rate of the board links or PCIe bus was one potential bottleneck. Other bottlenecks include heavy synchronization overhead and frequent trips to the driver to manage the GPUs. A lot of that management can come from a graphics context that doesn't allow for multiple GPUs to be treated as a single unit, and an architecture that doesn't define correct behavior for multiple GPUs.
A GPU piggy-backing on the fabric has its own L2, internal controllers, countless buffers, and barely coherent caches. None of them know what to do with the equivalent resources in the other GPU, and unless doing something like duplicated contexts like stereoscopic rendering sharing means more frequent pipeline stalls, trips to the driver, and cache flushes--or incorrect behavior and crashes.

The infinity fabric in this case is a dumb pipe that cannot make the dumber hardware on either end smarter.
The disclosures on Navi don't really indicate many changes on this front. Maybe someday some of the secondary GFX pipeline elements could be spun into a more flexibly partitioned and distributed graphics context, but right now it seems more likely to be used for better priority management within a GPU.
Other elements like another incoherent cache layer with the L1 make things like sharing worse, and the L2 doesn't appear to be any smarter than before. The DCC everywhere feature means there's as many or more compressors, and if they're like prior generations that means multiple incoherent pipelines in a multi-chip context. There are additional controllers and pipelines elsewhere, and no mention of making them capable of functioning alongside duplicates in a multi-chip scenario.

Thanks once again for your insightful reply.

I have the understanding that AMD's Infinity Fabric is not dumb. Part of it's patent is that it serves duplicity as a control fabric as well, and is addressable, too.

We don't ever get to see the real Ryzen Master in AMD labs. But Infinity Fabric controls all aspects of the CPU, why can't it control mirroring L2 cache across multiple GPUs? It looks like even the two Shader Engines sit on IF, along with 64-bit memory controllers, Geometry Processor, display engine, etc all sit on the fabric..?

If you look at the block diagrams: (RDNA/NAVI slide deck, courtesy if Anadtech)

The reason I am asking, is that it seems that Infinity Fabric can be used to unify shader memory in some fashion. And if/how it correlates to AMD's new cache system with RDNA, making a dual-gpu a thing again(?).

ed: Also, could altered frame rendering work at lightening speeds, if they don't have unified memory?

pTmdfx · Aug 15, 2019

w0lfram said:
Thanks once again for your insightful reply.

I have the understanding that AMD's Infinity Fabric is not dumb. Part of it's patent is that it serves duplicity as a control fabric as well, and is addressable, too.

The reason I am asking, is that it seems that Infinity Fabric can be used to unify shader memory in some fashion. And if/how it correlates to AMD's new cache system with RDNA, making a dual-gpu a thing again(?).

ed: Also, could altered frame rendering work at lightening speeds, if they don't have unified memory?

Might be odd to mention a Nvidia Research paper in an AMD thread, but its problem statement does cover well the essential context for this particular topic. The paper itself might also be a testament that such configuration is perhaps still a novelty, since it requires either bold new ideas in cache hierarchies and memory management to overcome the loss of interconnect bandwidth, or a novel inter-chip interconnect technology that kicks the problem away. The magical Infinity Fabric bullet alone does not help mend any of these, since they are boarder-scoped architectural & reality-of-life (well, physics) problems.

It is the transport plan of the country that requires an overhaul for the bold multi-chip GPU vision; the high-speed transport technologies are merely an aftermath of that, i.e. the implementation detail.

3dilettante · Aug 15, 2019

w0lfram said:
Thanks once again for your insightful reply.

I have the understanding that AMD's Infinity Fabric is not dumb. Part of it's patent is that it serves duplicity as a control fabric as well, and is addressable, too.

I may have failed to clearly indicate that I was considering the fabric's functionality in the context of graphics context execution. In this case, its function is more of a dumb pipe, since it operates at a low-level and shouldn't need to care about the specifics of what the graphics pipeline's functions or data patterns mean. AMD indicated that the fabric was meant to be a more generic and flexible interconnect, which means it doesn't alter how it operates to match the specifics of what it is joining.
The Infinity Fabric has been described as a superset of Hypertransport, and that protocol has a relatively compact set of operations related to transferring data, carrying requests, and carrying responses. Those operations and a few guarantees on ordering and timing are the job of the fabric.
It's the job of the clients on the fabric to have the in-built management or functionality to use those basic operations to construct the more complex communication and coherence methods. If those clients don't have more complex management and don't ask anything of the fabric, the fabric will function well enough doing the same simple methods that were done prior to it being added.

We don't ever get to see the real Ryzen Master in AMD labs. But Infinity Fabric controls all aspects of the CPU, why can't it control mirroring L2 cache across multiple GPUs? It looks like even the two Shader Engines sit on IF, along with 64-bit memory controllers, Geometry Processor, display engine, etc all sit on the fabric..?

I guess I'm not sure what aspects of the CPU you list as being controlled by the fabric. For Zen, the fabric stops at the L3/fabric interface, and for Vega and Navi the fabric stops at the output of the L2.

Slide 17 of the following shows how the infinity fabric connects various fabric stops, which for the CPU is called the CCM. This translates whatever is happening on the other side of the stop into packets and transactions the fabric can understand.
https://www.slideshare.net/AMD/amd-epyc-microprocessor-architecture

The diagram at the following shows the fabric is outside the GPU L2.
https://www.anandtech.com/Gallery/Album/7177#25

The reason I am asking, is that it seems that Infinity Fabric can be used to unify shader memory in some fashion. And if/how it correlates to AMD's new cache system with RDNA, making a dual-gpu a thing again(?).

ed: Also, could altered frame rendering work at lightening speeds, if they don't have unified memory?

AMD's new GPU cache hierarchy is internal to the L2, which makes it unclear what the fabric needs to do differently.
Alternate frame rendering can work with or without unified memory, on games that are written to avoid techniques that reuse data from a prior frame.
What has made it undesirable is that various kinds of temporal or asynchronous techniques do wind up trying to transfer data between GPUs if they are alternating frames, in addition to complex driver handling, synchronization overhead, pacing issues, and latency.

Unified memory can reduce the overhead of transfering data between the memory pools. There's potentially lighter synchronization needed to manage the transfers, and unified memory usually comes with links with sufficient bandwidth to move the data.
The GPUs still have heavy-weight synchronization, pacing challenges, and driver-based handling of these self-tuning IO devices. The rest of the architectures haven't changed.
A command processor in one GPU doesn't have functionality to work with its counterpart in the other GPU. The L2 of one GPU doesn't respond to requests from the other GPU, and at the same time is not designed to request anything from another GPU. The fabric doesn't change that the clients on it have mostly the same intelligence about execution and coherence as before. Some things are faster with the fabric, but the GPUs don't act any differently--and how they act is generally broken in this use case.

w0lfram · Aug 15, 2019

3dilettante said:
A command processor in one GPU doesn't have functionality to work with its counterpart in the other GPU. The L2 of one GPU doesn't respond to requests from the other GPU, and at the same time is not designed to request anything from another GPU. The fabric doesn't change that the clients on it have mostly the same intelligence about execution and coherence as before. Some things are faster with the fabric, but the GPUs don't act any differently--and how they act is generally broken in this use case.

Thank you once again for your knowledgeable insight.

I now understand the difficulties of getting two GPUs to scale properly in Gaming. I quoted the paragraph above because it is most of what I was getting at. Is it that, two command processors cant ever have such functionality(?), or there is no economic reason for getting so elaborate and adding it, etc?

Lastly, will we be able to see stand-alone chiplets that aid in ray tracing, or as a graphics co-processor? (Even if there is a slight performance hit)

JoeJ · Aug 15, 2019

w0lfram said:
Lastly, will we be able to see stand-alone chiplets that aid in ray tracing, or as a graphics co-processor? (Even if there is a slight performance hit)

I have proposed this a while back. It would make sense to have unique raster, RT, compute chiplets. I think it would be possible to adapt software development so communication between them is not too high (but i'm a hardware noob).
However, the cheaper solution could be to have just one type of chiplet? If that's true, i would not give up on this idea so quickly.

3dilettante said:
What has made it undesirable is that various kinds of temporal or asynchronous techniques do wind up trying to transfer data between GPUs if they are alternating frames

One option here would be to assign regions of memory to be writeable only from a certain chiplet at a time, and that region would be invisible to others. (e.g. parts of current framebuffer)
Texture / mesh / last frame data could be in regions declared as read only. This way no sync of cache would be necessary.

Not sure if that's a naive proposal and if it would help at all, but such additional hurdles would be managable.

AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Bondrewd

Shaklee3

w0lfram

Shaklee3

Malo

Yak Mechanicum

Rootax

pTmdfx

Frenetic Pony

Bondrewd

Frenetic Pony

w0lfram

milk

Like Verified

Ethatron

pTmdfx

3dilettante

w0lfram

pTmdfx

3dilettante

w0lfram

JoeJ