AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
It would more so be in the purview of the manufacturing side in which both TSMC and Samsung have alternative developments at varying stages.

Example - https://semiwiki.com/semiconductor-...ghts-of-the-tsmc-technology-symposium-part-2/

CoWoS-L

A new chip-last offering was introduced – CoWoS-L. Like the embedded LSI interconnect bridge added to the InFO offering, a similar configuration is being added to the CoWoS assembly. The silicon interposer is replaced by an organic substrate with an embedded LSI chiplet, offering interposer-like interconnect signal density in a more cost-effective assembly.

At least to my knowledge chiplet designs and inter chip connectivity have been on the minds of the industry for quite some time now and so the entire chain has been pursuing supporting technologies to support that paradigm. Despite the "technical marketing" neither AMD's chiplet design for CPUs nor Intel's EMID is some sort of out there surprise technology even though you can leverage them for some mind share due to being first to mass market.
 
128MB is identical to ThreadRipper's L3, make of it what you want. ;)

No, it's not identical but TOTALLY DIFFERENT architecture-wise. There is MUCH MORE than just total size in caches.
It is probably built of similar physical SRAM blocks, but those are organized and connected totally differently.

Threadripper/EPYC has 8..16 separate 8..16 MiB caches, one for each CCX.

If every core is reading the same address, it will be cached to 8..16 separate L3 caches, and each core can only use 8(zen1) or 16(zen2) mgabytes of cache. Lots of duplication of same data, and misses when data is in the wrong L3 cache. If all the cores are operating on the same data, the L3 hit rate in zen/zen2 is much worse than hit rate of single 128 MiB L3 cache would be.

On RDNA2, there is 128 MiB of cache that is shared by ALL the cores.

Also, it's still unknown whether the "infinity cache" of RDNA2 is "memory-side" or "core-side" cache.
 
There is no (longer a) good way of splitting GPU onto multiple dies. All parts of the GPU need very high bandwidth to the memory and/or other parts of the GPU (much higher bandwidth than CPUs need).

Not disagreeing, just to put out some numbers:

(Navi 10) GL1 to L2:
16 channels * 2 (bidirectional) * 64 bytes/clk * 2.0 GHz (assumed) = 4 TB/s
16 * 2 * 512 = 16384 wires o_ o

Though for sure you could use SerDes, say at ~12.5 GT/s like Zepplin OPIO, which then reduces the need by 6x. But it is still approximately 2-3 HBM worth of data wires. In other words, for this to happen, it would need more than substrates for higher density on-package I/O than EPYC.

Edit:
(Navi 10) L2 to IF/SDF/MC:
16 channels * 2 (bidirectional) * 32 bytes/clk * 1.75 GHz (persumably mclk @ 14Gbps) = 1.7 TB/s

Navi 21 seemed to have upped from 32B/clk to 64B/clk, according to the footnote. Persumably it is for the Infinity Cache, if we assume IC being memory-side.

But even if they would move the memory controller, other IO and the infinity cache to another die, below the main die, they would have a dilemma that would mfg tech to use for that chip:

eDRAM does not work at all on new mfg processes.
SRAM wants to be made with as new mfg tech as possibly to be dense.
PHYs want to be made on old mfg tech to be cheaper, as they do not scale well.

Ok, theoretically there could be the option of using a very old process for the IO die and eDRAM, but that would be then being stuck with obsolete tech.
Putting eDRAM aside, they could continue to manufacture all dies on the same bleeding edge processes. So at least they still get the cost saving part of the deal in reuse & dodging big monolithic chips, and get around the rectile limit of a single die.
 
Last edited:
No, it's not identical but TOTALLY DIFFERENT architecture-wise.

Sorry, I didn't make a well interpretable comment. I meant to imply that the size of the cache is seemingly not a show-stopper for migration into a seperate i/o die, as in Threadripper it goes upto 256MB, if that was one of Jawed's concerns.
In the presentation they said they "templated" off some things of the Zen architecture's L3, which can mean anything really, but it leaves the possibility that there are architectural similarities. In the die shot you see the cache behind the fabrics wall. I can only guess it might mean that a unified protocol between cache and core of CPU [core die] &| GPU [core die] &| cache [i/o die] through the fabric could a possibility (apart from the obvious banking differences CPU need and are not the same as for GPUs, thinking of a APU here), but more specifically that a die seperation between the inside and outside at the fabric wall is a low complexity possibility.
 
Moores law is going to exactly OTHER direction than towards "Chiplets" . We can afford to have MORE functionality in one die. MCMs were a good idea with Pentium Pro in 1995 and multiple chips were a good thing in Voodoo 1 and Voodoo2 in 1997 and 1998. Since them, Moores law has made them mostly obsolete.
There is one trend that’s in favour of chiplets, and that is rising fixed costs for chips on leading edge processes.
AMD going with chiplets for Zen was not necessarily optimal vs. monolithic as the performance of both Renoir and to some extent Ryzen 3300x demonstrate. But it allowed AMD to adress a wide spectrum of the x86 market, from modest self-builders, via HEDT, to big server solutions, with a single chiplet design. That not only saved them a lot of money on lithographic mask sets, but more importantly design cost and design time. Given their financial situation I don’t believe they could have done anywhere near as well with monolithic chip designs adressing all those market segments.

That doesn’t mean it’s a universal recipy for success by any means, mobile SoCs are a shining counter example for instance. Since we’re talking about GPUs here, they enjoy the advantage of scalable design, so producing for instance half-of-everything designs are way cheaper than the original new base design. That one will still cost you a pretty penny though, and the mask sets are quite costly for each distinct chip. So if you are targeting consumer products, you really want high volume on your products to amortise your fixed costs over. Depending on what the market looks like, that could be an argument for trying to make a chiplet type design work, despite its challenges. Personally I doubt it makes all that much sense for GPUs. Nvidia and AMD seem able to cover the entire current GPU spectrum with at most three chips, that can be designed by scaling the resources of a base chip. Yield for the biggest chips could be an issue, but TSMC seems to do well with yields even on 5nm, designing with redundancy helps of course, and nVidia and AMD would use frequency binning and cutting down functional units anyway to provide a wide range of SKUs.

Moores law is dead by the way. Nobody told you? ;-) Packaging is where it’s at these days.
 
I'm not sure if anyone explained why they went with 128MB? Is that the sweet spot? Is more actually mo better? Also I'm guessing SRAM shrinks pretty damn well with node shrink, much better than memory interface. So pretty damn forward looking too.

128MB is ~ how much you need to fit the framebuffers for most demanding games at 4k.

No "chiplets", unless they move into 3D packaging with the memory controller/IO die below and GPU die above.

And AFAIK AMD does not have access to any EMIB-like packaging technology.

TSMC has been pushing hard on multiple different 3d packaging technologies, one of which is very similar to EMIB. TSMC's own site about them is sadly totally incomprehensible buzzword bingo, I personally much prefer the much more readable reasonably recent Anandtech article about them. It's important to note that these things are now actually being pushed into real, reasonably high-volume products. Nothing as high-volume as GPUs yet, but no longer just really low-volume one-off products either.

I think it's a good bet that both AMD CPUs and GPUs will end up using something from this set of technologies, it's just a question of timing. I think on the first generation that they actually use it, there will be only one or two products doing it, so they don't have to bet an entire product cycle on relatively unproven tech. Then once it's proven the next gen products are all in.

But even if they would move the memory controller, other IO and the infinity cache to another die, below the main die, they would have a dilemma that would mfg tech to use for that chip:

eDRAM does not work at all on new mfg processes.
SRAM wants to be made with as new mfg tech as possibly to be dense.
PHYs want to be made on old mfg tech to be cheaper, as they do not scale well.

If they are using any kind of 3d integration, they are going use some HBM derivative and stack the memory on it too, essentially removing this issue from the equation.

Moores law is going to exactly OTHER direction than towards "Chiplets" . We can afford to have MORE functionality in one die. MCMs were a good idea with Pentium Pro in 1995 and multiple chips were a good thing in Voodoo 1 and Voodoo2 in 1997 and 1998. Since them, Moores law has made them mostly obsolete.

Everyone in the industry disagrees with you here. Look at what AMD, Intel, and TSMC are talking about, everyone is moving towards disaggregating things from one chip to many small ones. The driving force behind this is that everyone apparently believes that yields on large chips will be worse in the future than they are now.

I agree that the hard part of splitting a rasterizing GPU are the ROPs. Every simple solution results in massive traffic needed essentially to ensure coherency in them. The solution is either do it late in the pipeline with massive inter-chiplet bandwidth provided with some kind of 3d packaging system, which is conceptually easy but technically challenging, or some kind of binning and data movement in the early parts of the pipeline to get the right pixels to the right pipelines before you shade them, which is easier on the hardware but much harder on the software, especially for providing good performance on older games.

I do want to note that all of this goes away once everything goes RT. The ideal RT accelerator can consist of many small chiplets that basically don't even need to know of each other's existence. It's okay for the uppermost levels of the acceleration structure to be duplicated in every cache, as they are relatively small anyway -- good trees widen quickly. Assuming that the scene is partitioned into contiguous chunks for each RT chiplet, the cache usage partitions exactly the right way -- any primitive or part of the acceleration structure that is needed to draw a pixel is mostly likely needed to draw a pixel right next to it on screen, and least likely needed to draw a pixel far away from it.
 
I don't know about moore's law but 3080 is not in average even 2x faster than my 3.5 old 1080ti. That despite the chip sizes and power draws getting bigger and new gpu's being more expensive. If it's roughly 4 years to double gpu performance that is just sad. And if this gets even slower it's even more sad.

That said I expect rdna3 vs. nvidia be very interesting. Both players are on the ball and that could bring the best out of both sides. Maybe rdna3 upgrade cycle is best we have seen long time or perhaps doubling of perf will happen now every 5 years :(
 
It's worth remembering that Tahiti launched with 264GB/s 8 years ago. So we're approaching 3x the performance per unit bandwidth...
 
128MB is ~ how much you need to fit the framebuffers for most demanding games at 4k.

Today, sure. My concern is tomorrow. UE4 already has a "next gen only" TAA update, and I'd have to check it but based on other TAA advances I bet it uses at least one extra 8bit target if not more already.

Now for competition, likely these same buffers will be pulled a lot, choking the 30XX series for bandwidth even if RDNA2 gets choked by limited cache. Ampere gaming just doesn't have enough bandwidth to go around. But I still suspect we'll see edge cases where buffers overrun 128mb without pulling a ton of traffic.

Anyway, chiplets make a hell of a lot of sense for GPUs. Yields are pretty terrible for both the new console APUs, and "Big" Navi is even larger. Chiplets have advantages across the board. You only need to tape out one or two relatively small chiplet designs, and with design costs and times growing exponentially with each new node that alone is a huge cost and time saver. And by saving time you can also update faster, if your compute or I/O die is updated you don't necessarily need to wait for the other die(s) to be updated to put out a new product, we can see this shipping already with Zen processor generations. Then of course the yields go up mightily with tiny chiplets, as do binning possibilities. You can even scale beyond reticle limits, something Nvidia is no doubt already hotly anticipating, as it's alread in R&D at TSMC for their interconnects. And it probably opens up Samsung a lot more as competition to TSMC. If yields are pretty bad in comparison, but you're getting far more usable silicon thanks to tiny chiplets anyway, it could make S a lot more attractive. Not too mention with better yields going wider than ever is even more affordable, which could make it sensible even for mobile designs. Who needs high clocked chips when it's relatively cheap to throw 50% more silicon at the problem while getting a better power efficiency.
 
Moores law is dead by the way. Nobody told you? ;-) Packaging is where it’s at these days.

Moores law is very far from dead. It has only slightly slowed down, density is doubling every ~3 years instead of 2 years.

Navi 21 has about 2.1 times more transistors than Vega 10, which was released about 38 months earlier.
nVidia GA100 has about 3.5 times more transistors than GV100, which was released 3 years earlier - so here actually nVidia has stayed with the pace of Moores law.

TSMC N7 is 2.5 times more dense than GF "14nm" which came about 3 years earlier.

And Moores law has NEVER said anything about clock speed or single-thread performance.


 
Moores law is very far from dead. It has only slightly slowed down, density is doubling every ~3 years instead of 2 years.

Navi 21 has about 2.1 times more transistors than Vega 10, which was released about 38 months earlier.
nVidia GA100 has about 3.5 times more transistors than GV100, which was released 3 years earlier - so here actually nVidia has stayed with the pace of Moores law.

TSMC N7 is 2.5 times more dense than GF "14nm" which came about 3 years earlier.

And Moores law has NEVER said anything about clock speed or single-thread performance.


Slowing down = dead if it was about the exponential growth in the first place. The main thing why Moore's Law is dead isn't that we aren't getting any improvement any more, it is we will never be back where performance gains from just moving to a new node and adding more transistors automatically made much much better ICs every generation. We don't get to just pack 2x the number of faster transistors in the same space without worrying about power and cost like the 70s, 80s and 90s. The whole paradigm enabled general purpose ICs to dominate the market because designing specialized hardware didn't make financial sense when you have Intel coming in with a new CPU that did pretty much every thing twice as fast every 2 years. Nowadays you don't get that. We are going to need specialized hardware to do specialized things and this has been going on for the past 10 years.

Another thing to note is cost per transistor has exploded even when moving to denser nodes. Electrical characteristics of transistors on the newer nodes are just plain not scaling. 5nm is barely an improvement to 7nm in practical terms such as power, performance, cost. Who cares if it's denser if it doesn't gain anything else? The physical size of the chip has never been a big thing for new nodes.

Here is Moore's original quote:
The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.

We pretty much stopped scaling like this on 28nm. Even moving to 14nm was expensive and cost of IC over the years has been going up. Wafer cost on the newer nodes are going up almost as fast as transistor density. Took 2 extra years before 14nm able to overtake 22nm for Intel and because of Intel's insistence on scailing, they are still on 14nm after trying to move beyond that for 6 years. Every new node is just more bandaids put on top to make it work making things more expensive. New tech is helping but all of it temporary. None of the new tech (EUV, GAAFET, TSV, Stacking, etc) is going to solve the problem long term since none of them enable the exponential scaling that used to just come basically free. They will all hit limits within a few generation of use. The requirements for new tech to keep progressing is getting higher and higher and this is what really defines post Moore's Law vs what used to Happen during Moore's Law back in the day.

The law is long dead at this point unless you keep changing the definition like some people like to do.
 
Yes, you are missing a lot here:

1) Where to put the ROPs?
2) The control things should be close enough to the cores.
3) This kind of architecture would mean L3 caches near the dies and L4 cache on the memory controller die. But this would be VERY inefficient form the hit rate / total cache, because of multiple reasons:
a) Multiple cores are operating same triangle or nrearby triangles will access the same area of the framebuffer, which wants to be in ONE cache, not multiple split ones.
b) L4 cache which has similar size than L3 cache would be quite useless. Unless it's a victim cache, either you hit both or you hit neither.
4) What manucactureign tech to use for the IO die? SRAM cache wants to be made on NEW dense mfg tech, PHYs with old cheap mfg tech. The big gain in zen2 matisse comes from the old mfg tech of the IO die, which does not make sense if you have SRAM L3 cache there.

Moores law is going to exactly OTHER direction than towards "Chiplets" . We can afford to have MORE functionality in one die. MCMs were a good idea with Pentium Pro in 1995 and multiple chips were a good thing in Voodoo 1 and Voodoo2 in 1997 and 1998. Since them, Moores law has made them mostly obsolete.


Thanks for the info, I'm no GPU export but my answers...

1) on each chiplet, next to shading/Rt engine as they are now.
2) OK, this is the part i know least about, i'm pretty good with modern CPU tech, but fall down when we get top the nitty gritty of breaking Up GPU workloads and distributing them among units.
3) Sorry no L4 in the IO die, just the memory controller - You've already got a low latency high BW cache close to the Units, added abit of latency to go to actual memory isn't going to kill you
a) yeah i'm not sure about this? What about just accepting some pixel will be duplicated during rendering?
b) as before no L4 cache, L3 on each GPI chiplet, then talk to a big fat IO die which is the memory controller
4) Whatever is best for the chosen memory interface, likely similar to what AMD do now, and do it on an older process that is better for high pin count IO

I understand Moore's law, but I also understand modern reticule limits, if you want a GPU that is 10x to 100X times the power of a 3090 or 6900, it's not going to happen on a single die, Moores law or not.

:)
 
I don't know about moore's law but 3080 is not in average even 2x faster than my 3.5 old 1080ti. That despite the chip sizes and power draws getting bigger and new gpu's being more expensive. If it's roughly 4 years to double gpu performance that is just sad. And if this gets even slower it's even more sad.

That said I expect rdna3 vs. nvidia be very interesting. Both players are on the ball and that could bring the best out of both sides. Maybe rdna3 upgrade cycle is best we have seen long time or perhaps doubling of perf will happen now every 5 years :(

Moores law has never been about performance. It has always been about the number of transistors per chip. And that 3080 has 2.33 times more transistors than your 1080ti, and ot came about 3.5 years later. That's about doubling per 3 years, only slight slowdown in Moores law.

And about performance: That 3080 is something like 10x faster than your 1080ti when performing ray tracing, or when calculating neural networks.

Despite having 2.33 times more transistors it's not even 2x faster when performing rasterization on old games because
1) the performance has been increased elsewhere than in rasterization performance
2) those old games are also bottlenecked by CPU and memory bandwidth
3) lots of those transistors are used for more cache to make the memory less of a bottleneck.
 
Last edited:
Thanks for the info, I'm no GPU export but my answers...

1) on each chiplet, next to shading/Rt engine as they are now.

... so you would have multiple separate clusters of ROPs and long latency between them.

Quess what happens when you render two overlapping triangles, one in another die and and another on another die almost at the same time?

You either have cache coherency between your L3 caches (which always adds a LOT over traffic between your dies and then take a considerable performance hit in this situation) or you are not cache-coherent and render this situation incorrectly.

Or you make lots of limitations that they cannot render the same area of screen (or same area of non-screen buffer) and add a lot of overheads and inefficiencies from these limitations.

a) yeah i'm not sure about this? What about just accepting some pixel will be duplicated during rendering?

A1) In the original post I was replying to it was claimed that there are cache advantages in the caches. I'm pointing out that In reality, it's the opposite. There are cache disadvantages from this.
A2) The bigger problem is in writes. The nearest or latest copy of that pixel has to be the one that is the final copy of that pixel.

You cannot just "accept" that you have a wrong color value.

4) Whatever is best for the chosen memory interface, likely similar to what AMD do now, and do it on an older process that is better for high pin count IO

The point is: There are (manufacturing) technology AND market situation (WSA) specific REASONs for AMDs zen2/zen3 MCM ("chiplets") .There is no such reasons for MCM in a GPU generally.

MCM is not the future. It's the PAST AND a niche solution for certain individual chips. There is nothing in RDNA3 that makes it good candidate for MCM with multiple logic chips, not for technical neither marketing reasons.

I understand Moore's law, but I also understand modern reticule limits, if you want a GPU that is 10x to 100X times the power of a 3090 or 6900, it's not going to happen on a single die, Moores law or not.

:)

No. It seems you do not underestand the economics of microchip manufacturing

If you want to make a GPU that consumers can afford you are not limited by the reticule size and are even less limited by the reticule size in the future

The reticule size is not decreasing with new mfg processes.

But the cost per die area is increasing with new mfg processes.

This means that the cost of chip that is reticule-limited is increasing.

Also, power density is increasing. We are getting even more performance out from the same die area while consuming even more power.
If we want to be able to keep our dies cool to keep running those at high speeds, putting those into same package is not helping, it's hurting.


MCMs generally only make sense if you want to make as powerful enterprise-class system as possible inside one package


If you want to make a consumer-priced chip that has 10x-100x times more performance than 3090 or 6090, splitting the logic into multiple dies and having lots of communications overheads and more latency will not help you. When you can afford the die area needed for that performance, you can then also afford that area inside single logic die and have none of these problems.
 
Last edited:
Slowing down = dead if it was about the exponential growth in the first place. The main thing why Moore's Law is dead isn't that we aren't getting any improvement any more, it is we will never be back where performance gains from just moving to a new node and adding more transistors automatically made much much better ICs every generation. We don't get to just pack 2x the number of faster transistors in the same space without worrying about power and cost like the 70s, 80s and 90s. The whole paradigm enabled general purpose ICs to dominate the market because designing specialized hardware didn't make financial sense when you have Intel coming in with a new CPU that did pretty much every thing twice as fast every 2 years. Nowadays you don't get that. We are going to need specialized hardware to do specialized things and this has been going on for the past 10 years.

Another thing to note is cost per transistor has exploded even when moving to denser nodes. Electrical characteristics of transistors on the newer nodes are just plain not scaling. 5nm is barely an improvement to 7nm in practical terms such as power, performance, cost. Who cares if it's denser if it doesn't gain anything else? The physical size of the chip has never been a big thing for new nodes.

And all these power/area and cost/area increases just means MCMs ("chiplets") are making even LESS SENSE. Because that MCM which has lots of die area is VERY expensive to make and very hard to cool. The amount of chip area that we can effectively cool inside one package and which consumers can afford is getting SMALLER. So that it's EASIER to make that as a monolithic die.

Here is Moore's original quote:


We pretty much stopped scaling like this on 28nm. Even moving to 14nm was expensive and cost of IC over the years has been going up. Wafer cost on the newer nodes are going up almost as fast as transistor density. Took 2 extra years before 14nm able to overtake 22nm for Intel and because of Intel's insistence on scailing, they are still on 14nm after trying to move beyond that for 6 years. Every new node is just more bandaids put on top to make it work making things more expensive. New tech is helping but all of it temporary. None of the new tech (EUV, GAAFET, TSV, Stacking, etc) is going to solve the problem long term since none of them enable the exponential scaling that used to just come basically free. They will all hit limits within a few generation of use. The requirements for new tech to keep progressing is getting higher and higher and this is what really defines post Moore's Law vs what used to Happen during Moore's Law back in the day.

This quote is not from Moores original paper. This quote is written about 50 years later than the original paper.

Here is the original Moores law paper:

https://newsroom.intel.com/wp-content/uploads/sites/11/2018/05/moores-law-electronics.pdf


The law is long dead at this point unless you keep changing the definition like some people like to do.

The definition was changed already 45 years ago, originally Moore was talking about doubling of transistor density EVERY years.

So, Moores law was either dead already 45 years ago, OR it has slowed down twice, first 45 years ago and now again.

But it cannot honestly both survive one slowdown and to be declared dead on second slowdown.
 
Last edited:
... so you would have multiple separate clusters of ROPs and long latency between them.

Quess what happens when you render two overlapping triangles, one in another die and and another on another die almost at the same time?

You either have cache coherency between your L3 caches (which always adds a LOT over traffic between your dies and then take a considerable performance hit in this situation) or you are not cache-coherent and render this situation incorrectly.

Or you make lots of limitations that they cannot render the same area of screen (or same area of non-screen buffer) and add a lot of overheads and inefficiencies from these limitations.
That's already the case. ROPs don't access arbitrary render target locations, they're each assigned to RT tiles in a repeating pattern.
 
That's already the case. ROPs don't access arbitrary render target locations, they're each assigned to RT tiles in a repeating pattern.

Yes, and this works because all rops are on a single chip and there is essentially a massive crossbar (or some other N-M switch) between them and the shader array. Any lane of any simd pipe can output a rop op that can end up at any rop. Getting this done in a distributed GPU without blowing up the power budget is really hard.

Unlike hkultala, I strongly feel that a "chiplet GPU" is the future and that it's going to happen. However, this is not something you can handwave away. This is IMHO the only actually really hard part of making a multi-chip GPU work. Everything else is comparatively easy, this is the part that you have to solve to get it work.

The solutions I've offered are either waiting for TSMC to get fancy enough 3d packaging/interconnect tech done, or somehow sorting and binning at the early part of the pipeline.
 
Any lane of any simd pipe can output a rop op that can end up at any rop. Getting this done in a distributed GPU without blowing up the power budget is really hard.

Aren't render target tiles assigned at the rasterizer and shader array level? Why would a shader array spit out pixels for a tile owned by a ROP in a different array.
 
Status
Not open for further replies.
Back
Top