AMD: RDNA 3 Speculation, Rumours and Discussion

manux · Feb 10, 2021

Chiplet for gaming workload requires explicit developer support to group work in a way that minimizes cross chip traffic. Same reason why SLI is dead. Moving data between chips eats a ton of power if it's done at same speed chip operates internally. Chiplets will happen for gaming once developers are on board for it. One could claim nvidia's server solutions are "kind of chiplets" as the gpu's are connected together via nvlink and share same cache coherent memory space. But even in that space the requirement is to be aware of penalty of chip to chip communication. However people are willing to take that into account and optimize to be able to solve bigger problems.

Chiplet's for cpu's are easier as there are independent workloads. Though even there we have seen issues communicating between chips causing performance degradation and needing to optimize to minimize hopping between chips.

Bondrewd · Feb 10, 2021

manux said:
Chiplet for consumer requires explicit developer support to group work in a way that minimizes cross chip traffic

Or you just build an overkill d2d interconnect that uses fancy adv packaging.

manux said:
One could claim nvidia's server solutions are "kind of chiplets" as the gpu's are connected together via nvlink and share same cache coherent memory space. But even in that space the requirement is to be aware of penalty of chip to chip communication

It does but NVLink is slow by GPU standards.
A100 NVLink b/w total is around 600GB/s with A100 main DRAM b/w being over twice that number.
And that's DRAM.

manux said:
Chiplet's for cpu's are easier as there are independent workloads.

Fuck no, the sheer amount of CC magick going into stuff like Rome or SPR is amazing.
CPU CC is a nasty, bulky and hard abstraction we all deal with cuz we have to.

manux · Feb 10, 2021

Bondrewd said:
Or you just build an overkill d2d interconnect that uses fancy adv packaging.

It does but NVLink is slow by GPU standards.

Fuck no, the sheer amount of CC magick going into stuff like Rome or SPR is amazing.

If I don't remember all wrong the nvlink bandwidth in latest nvidia servers is something like 300GB/s down and 300GB/s up. Combined 600GB/s. It's pretty decent nvlink speed in and out of single chip. Single link isn't that fast but there is multiple nvlinks per chip and flexibility on how to route the traffic between chips using one or more of the links.

Bondrewd · Feb 10, 2021

manux said:
If I don't remember all wrong the nvlink bandwidth in latest nvidia servers is something like 300GB/s down and 300GB/s up. Combined 600GB/s. It's pretty decent nvlink speed in and out of single chip.

Yeah but A100 alone has over twice the b/w to main DRAM.
And its L2 bandwidth per each NUCA segment is something unholy.

DegustatoR · Feb 10, 2021

Bondrewd said:
Oh but they mostly did.

Yeah "mostly" as in "not at all".
As I've said, let's wait and see. It's not a given that a chiplet design on N5 would be better than a traditional one on N7 for example.

manux · Feb 10, 2021

I remember why I don't participate on these threads. I'll take my coat and go wait to see what frontier is all about. This easily could be another gddr6x moment for some people.

Bondrewd · Feb 10, 2021

DegustatoR said:
Yeah "mostly" as in "not at all".

Man that's a lotta seething from you.

DegustatoR said:
As I've said, let's wait and see

Of course.
It's always funny with you.

DegustatoR said:
It's not a given that a chiplet design on N5 would be better than a traditional one on N7 for example.

Of course it would be, even a single tile would be quite a lot more mean.
N5(p) is also some more speed, and everyone loves speed.

Lower end N3x's are single dies for mobile anyway.
So ugh I dunno what's even your point.

manux said:
I'll take my coat and go wait to see what frontier is all about

It's very simple, Trento and 4 MI200s doing a thing.
Add some Slingshot juice and ta-da!

DegustatoR · Feb 10, 2021

Bondrewd said:
Lower end N3x's are single dies for mobile anyway.
So ugh I dunno what's even your point.

That's pretty obvious tbh

Bondrewd · Feb 10, 2021

DegustatoR said:
That's pretty obvious tbh

Yeah the packaging volumes can be a bit too soulcrushing given the target volume.

Rootax · Feb 11, 2021

manux said:
Chiplet for gaming workload requires explicit developer support to group work in a way that minimizes cross chip traffic. Same reason why SLI is dead. Moving data between chips eats a ton of power if it's done at same speed chip operates internally. Chiplets will happen for gaming once developers are on board for it. One could claim nvidia's server solutions are "kind of chiplets" as the gpu's are connected together via nvlink and share same cache coherent memory space. But even in that space the requirement is to be aware of penalty of chip to chip communication. However people are willing to take that into account and optimize to be able to solve bigger problems.

Chiplet's for cpu's are easier as there are independent workloads. Though even there we have seen issues communicating between chips causing performance degradation and needing to optimize to minimize hopping between chips.

Hu the big thing is with rdna3&co is that devs don't have to do a thing about it anymore ?

manux · Feb 11, 2021

Rootax said:
Hu the big thing is with rdna3&co is that devs don't have to do a thing about it anymore ?

It's unlikely that would be true. The bandwidth needed to communicate between chips is huge(terabytes/s). That huge bandwidth translates into extra power consumption. Developer interaction is needed to make the workload such that each core can work mostly independently avoiding the bandwidth(power) cost. It's different with cpu's as it's much easier to find threads with independent workloads. Despite this even cpu's see issues when threads jump between chips,... And hence we need os that is aware of chiplets and keeps work in same chip, game engines need to be optimized etc.

GPU chiplets in gaming context only make sense after the biggest possible standalone chip is not enough anymore. And this implies developers are then on board to optimize. GPU chiplets in AI/HPC on the other hand make a lot of sense. Those workloads already are being distributed and being optimized for multi gpu/chip use cases.

tunafish · Feb 11, 2021

manux said:
It's unlikely that would be true. The bandwidth needed to communicate between chips is huge(terabytes/s). That huge bandwidth translates into extra power consumption.

I do not think you are fully appreciating the things that new, exotic packaging can get us. Chiplet GPUs using Zen-style packaging (separate dies on an organic substrate) are complete non-starters for reasons you have outlined. (You state that it's not possible without getting developers onboard, frankly I do not think getting them onboard is possible.)

The reason people are repeatedly proposing them is that the entire industry is right now falling over themselves trying to produce better ways to connect chiplets to each other, some of which have energy/transferred bit comparable to moving data between two structures on the same chip. So yes, we are literally proposing GPU chiplets with terabytes/s interconnect bandwidth, using some of the new packaging/integration technologies that are just maturing.

HLJ · Feb 11, 2021

You always run in the the physical universe:
Compute: Cheap
Moving data: EXPENSIVE!!!

manux · Feb 11, 2021

tunafish said:
I do not think you are fully appreciating the things that new, exotic packaging can get us. Chiplet GPUs using Zen-style packaging (separate dies on an organic substrate) are complete non-starters for reasons you have outlined. (You state that it's not possible without getting developers onboard, frankly I do not think getting them onboard is possible.)

The reason people are repeatedly proposing them is that the entire industry is right now falling over themselves trying to produce better ways to connect chiplets to each other, some of which have energy/transferred bit comparable to moving data between two structures on the same chip. So yes, we are literally proposing GPU chiplets with terabytes/s interconnect bandwidth, using some of the new packaging/integration technologies that are just maturing.

I would be curious to see numbers for power consumption transferring something like 1TB/s constantly between chips would take. Maybe you have numbers to share? Last real numbers I saw were high enough to make this approach not work. If the in between chip bandwidth doesn't match internal bandwidth that will create issues unless workload is optimized to take the lesser bandwidth and potentially higher latency into account. In essence the chiplet would not work as single big chip, rather there would be communication bottleneck in between.`

edit. Honestly, 1TB/s is probably not enough. Probably one would want to have at least same speed for chip to chip communication as is speed for infinity cache. Closer to 2TB/s read and 2TB/s write would be realistic want for rdna3 to start making the interconnect fast enough to not be major bottle neck.

Rootax · Feb 11, 2021

manux said:
It's unlikely that would be true. The bandwidth needed to communicate between chips is huge(terabytes/s). That huge bandwidth translates into extra power consumption. Developer interaction is needed to make the workload such that each core can work mostly independently avoiding the bandwidth(power) cost. It's different with cpu's as it's much easier to find threads with independent workloads. Despite this even cpu's see issues when threads jump between chips,... And hence we need os that is aware of chiplets and keeps work in same chip, game engines need to be optimized etc.

GPU chiplets in gaming context only make sense after the biggest possible standalone chip is not enough anymore. And this implies developers are then on board to optimize. GPU chiplets in AI/HPC on the other hand make a lot of sense. Those workloads already are being distributed and being optimized for multi gpu/chip use cases.

So what the point of the chiplet buzz if it's another Fury MAXX ?

That's why it's on paper a big deal. Same with what Imagination Tech is doing for exemple (on paper). If it's not, all the patents and buzz would have been BS, I don't see that happening...

Bondrewd · Feb 11, 2021

manux said:
The bandwidth needed to communicate between chips is huge(terabytes/s).

YES.
Now imagine the sheer amount of cache traffic a 96c Genoa part would do internally running AVX512 on a pair 512b FMAs per each core.

Rootax said:
So what the point of the chiplet buzz if it's another Fury MAXX ?

It's not buzz, the entire industry is gearing up for it.
x86 duo, TSMC, EDA vendors, you name them.
Big and mean d2d is here to come and stay.

https://www.servethehome.com/wp-con...l-Architecture-Day-2020-Packaging-AIB-2.0.jpg
Stuff like this.

w0lfram · Feb 11, 2021

AMD has been working with heterogeneous compute and unifying lvl 3 cache..

A newer HBCC using a new Infinity Fabric is all that is needed.

HLJ · Feb 11, 2021

People always want to break the laws of the universe...but he universe don't give a F....
Some reading, because I can see people with more brand bias than physical knowledge have started being silly:
Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs (nvidia.com)
hpca18-xor.pdf (niladrish.org)
Kestor.pdf (pnl.gov)
Data movement is overtaking computation as the most dominant cost of a... | Download Scientific Diagram (researchgate.net)

Gaming is a piss poor application for chiplets (due to frametimes being so vital)...other workloads are better suited for chiplets, but the fact remains:

You want to move your data as LITTLE as possible both on-chip and off chip.

Nebuchadnezzar · Feb 11, 2021

HLJ said:
People always want to break the laws of the universe...but he universe don't give a F....
Some reading, because I can see people with more brand bias than physical knowledge have started being silly:
Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs (nvidia.com)
hpca18-xor.pdf (niladrish.org)
Kestor.pdf (pnl.gov)
Data movement is overtaking computation as the most dominant cost of a... | Download Scientific Diagram (researchgate.net)

Gaming is a piss poor application for chiplets (due to frametimes being so vital)...other workloads are better suited for chiplets, but the fact remains:

You want to move your data as LITTLE as possible both on-chip and off chip.

Those papers are completely irrelevant. In the next year we'll have packaging and d2d interconnect methods that use 1/10th* the energy per bit of what's been used traditionally.

I don't remember where, but AMD/ had given an interview where they said the new technologies are allowing them unprecedented bandwidth that wasn't possible before.

* AMD last quoted 2pJ/bit for their IFOP SerDes: https://www.slideshare.net/AMD/isscc-2018-zeppelin-an-soc-for-multichip-architectures

http://www.guc-asic.com/en-global/news/pressDetail/glink

Error-free communication between dies with full duplex 0.7 Tbps traffic per 1 mm of beachfront, consuming just 0.25 pJ/bit (0.25W per 1 Tbps of full duplex traffic) was demonstrated.

Next generation GLink IP supporting 1.3 Tbps error-free full duplex traffic per 1 mm of beachfront with the same 0.25 pJ/bit power consumption is already available using TSMC 5nm process. Following generation of GLink supporting 2.7 Tbps/mm error-free full duplex traffic with the same 0.25 pJ/bit power consumption using TSMC 5nm and 3nm process will be available during 2021.

Those are the kind of numbers we're at today.

If Imagination says they can do segregated GPUs with just one wire between them, scaling perfectly well in graphics, I don't see why AMD couldn't.

manux · Feb 11, 2021

Rootax said:
So what the point of the chiplet buzz if it's another Fury MAXX ?

That's why it's on paper a big deal. Same with what Imagination Tech is doing for exemple (on paper). If it's not, all the patents and buzz would have been BS, I don't see that happening...

AI/HPC workloads in datacenter/professional use. There is huge incentive there to crunch data that doesn't either fit in one gpu or reasonable compute time requires multiple gpu's. Datacenter full of these types of things: https://www.nvidia.com/en-us/data-center/hgx/ John Carmack for example got one of these for ai work he is pursuing: https://www.nvidia.com/en-us/data-center/dgx-station-a100/

CPU's as chiplets make sense in consumer world as the nature of cpu tasks is such that cores can often operate independently of each other. Though it's not always the case and we did see perf issues initially(os scheduler update, better implementation in zen3, cp2077 issues requiring chiplet specific tuning etc).

AMD won multiple big supercomputer deals. My guess is chiplets go there first. Perhaps also as prosumer devices as they could be tempting in various non gaming use cases like research university/companies do. Perhaps chiplet goes to frontier or el capitan: https://www.amd.com/en/products/exascale-era

AMD: RDNA 3 Speculation, Rumours and Discussion

Similar threads